I am getting confused on how to design the architecure of crawler.
I have the search where I have
pagination: next page links to follow
a list of products on one page
individual links to be crawled to get the description
I have the following code:
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[#id=\'result-set\']/li')
items = []
for site in sites[:2]:
item = MyProduct()
item['product'] = myfilter(site.select('h2/a').select("string()").extract())
item['product_link'] = myfilter(site.select('dd[2]/').select("string()").extract())
if item['profile_link']:
request = Request(urljoin('http://www.example.com', item['product_link']),
callback = self.parseItemDescription)
request.meta['item'] = item
return request
soup = BeautifulSoup(response.body)
mylinks= soup.find_all("a", text="Next")
nextlink = mylinks[0].get('href')
yield Request(urljoin(response.url, nextlink), callback=self.parse_page)
The problem is that I have two return statements: one for request, and one for yield.
In the crawl spider, I don't need to use the last yield, so everything was working fine, but in BaseSpider I have to follow links manually.
What should I do?
As an initial pass (and based on your comment about wanting to do this yourself), I would suggest taking a look at the CrawlSpider code to get an idea of how to implement its functionality.
Related
I am new to scrapy and I want to do the following:
- I want to crawl a homepage and extract some specific listings
- with these listings I want to adjust the url and crawl the new web page
Crawling First URL
class Spider1:
start_urls = 'https://page1.org/'
def parse(self, response):
listings = response.css('get-listings-here').extract()
Crawling Second URL
class Spider2:
start_urls = 'https://page1.org/listings[output_of_Spider1]'
def parse(self, response):
final_data = response.css('get-needed_data').extract()
items['final'] = final_data
yield items
Maybe it is also possible within one spider, I am not sure. But what would be the best solution for it?
Thank you!
After extracting all links from your selector you need to yield Request to those links and add callback where you will receive HTML response
def parse(self,response):
yield Request(‘http://amazon.com/',callback=self.page)
def page(self,response):
## your new page html response
you can replace your extracted link with this amazon link.
Reference to the documentation scrapy Request
All the websites I want to parse are in the same domain but all look very different and contain different information I need.
My start_url is a page with a list containing all links I need. So in the parse() method I yield a request for each of these links and in parse_item_page I extract the first part of the information I need - which worked completely fine.
My problem is: I thought I could just do the same another time and for each link on my item_page call parse_entry. But I tried so many different versions of this and I just can't get it to work. They are the correct URLs but scrapy seems to just don't want to call a third parse() function, nothing in there ever gets executed.
How can I get scrapy to use parse_entry, or pass all these links to a new spider?
This is a simplified, shorter version of my spider class:
def parse(self, response, **kwargs):
for href in response.xpath("//listItem/#href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_item_page)
def parse_item_page(self, response):
for sel in response.xpath("//div"):
item = items.FirstItem()
item['attribute'] = sel.xpath("//h1/text()").get().strip()
for href in response.xpath("//entry/#href"):
yield response.follow(href.extract(), callback=self.parse_entry)
yield item
def parse_entry(self, response):
for sel in response.xpath("//textBlock"):
item = items.SecondItem()
item['attribute'] = sel.xpath("//h1/text()").get().strip()
yield item
I am scraping data using Scrapy in a item.json file. Data is getting stored but the problem is only 25 entries are stored, while in the website there are more entries. I am using the following command:
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["justdial.com"]
start_urls = ["http://www.justdial.com/Delhi-NCR/Taxi-Services/ct-57371"]
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
items = []
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
items.append(item)
return items
The command I'm using to run the script is:
scrapy crawl myspider -o items.json -t json
Is there any setting which I am not aware of? or the page is not getting loaded fully till scraping. how do i resolve this?
Abhi, here is some code, but please note that it isn't complete and working, it is just to show you the idea. Usually you have to find a next page URL and try to recreate the appropriate request in your spider. In your case AJAX is used. I used FireBug to check which requests are sent by the site.
URL = "http://www.justdial.com/function/ajxsearch.php?national_search=0&...page=%s" # this isn't the complete next page URL
next_page = 2 # how to handle next_page counter is up to you
def parse(self, response):
hxs = Selector(response)
sites = hxs.xpath('//section[#class="rslwrp"]/section')
for site in sites:
item = DmozItem()
item['title'] = site.xpath('section[2]/section[1]/aside[1]/p[1]/span/a/text()').extract()
yield item
# build you pagination URL and send a request
url = self.URL % self.next_page
yield Request(url) # Request is Scrapy request object here
# increment next_page counter if required, make additional
# checks and actions etc
Hope this will help.
I'm trying to scrap an e-commerce web site, and I'm doing it in 2 steps.
This website has a structure like this:
The homepage has the links to the family-items and subfamily-items pages
Each family & subfamily page has a list of products paginated
Right now I have 2 spiders:
GeneralSpider to get the homepage links and store them
ItemSpider to get elements from each page
I'm completely new to Scrapy, I'm following some tutorials to achieve this. I'm wondering how complex can be the parse functions and how rules works. My spiders right now looks like:
GeneralSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
start_urls = ['http://www.domain.org/home']
def parse(self, response):
links = LinksItem()
links['content'] = response.xpath("//div[#id='h45F23']").extract()
return links
ItemSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
# Each URL in the file has pagination if it has more than 30 elements
# I don't know how to paginate over each URL
f.close()
def parse(self, response):
item = ShopItem()
item['name'] = response.xpath("//h1[#id='u_name']").extract()
item['description'] = response.xpath("//h3[#id='desc_item']").extract()
item['prize'] = response.xpath("//div[#id='price_eur']").extract()
return item
Wich is the best way to make the spider follow the pagination of an url ?
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
Can I have different "rules" in the same spider to scrap different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
I've also googled looking for any book related with Scrapy, but it seems there isn't any finished book yet, or at least I couldn't find one.
Does anyone know if some Scrapy book that will be released soon ?
Edit:
This 2 URL's fits for this example. In the Eroski Home page you can get the URL's to the products page.
In the products page you have a list of items paginated (Eroski Items):
URL to get Links: Eroski Home
URL to get Items: Eroski Fruits
In the Eroski Fruits page, the pagination of the items seems to be JQuery/AJAX, because more items are shown when you scroll down, is there a way to get all this items with Scrapy ?
Which is the best way to make the spider follow the pagination of an url ?
This is very site-specific and depends on how the pagination is implemented.
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
This is exactly your use case - the pagination is made via additional AJAX calls that you can simulate inside your Scrapy spider.
Can I have different "rules" in the same spider to scrape different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
Yes, the "rules" mechanism that a CrawlSpider provides is a very powerful piece of technology - it is highly configurable - you can have multiple rules, some of them would follow specific links that match specific criteria, or located in a specific section of a page. Having a single spider with multiple rules should be preferred comparing to having multiple spiders.
Speaking about your specific use-case, here is the idea:
make a rule to follow categories and subcategories in the navigation menu of the home page - this is there restrict_xpaths would help
in the callback, for every category or subcategory yield a Request that would mimic the AJAX request sent by your browser when you open a category page
in the AJAX response handler (callback) parse the available items and yield an another Request for the same category/subcategory but increasing the page GET parameter (getting next page)
Example working implementation:
import re
import urllib
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class ProductItem(scrapy.Item):
description = scrapy.Field()
price = scrapy.Field()
class GrupoeroskiSpider(CrawlSpider):
name = 'grupoeroski'
allowed_domains = ['compraonline.grupoeroski.com']
start_urls = ['http://www.compraonline.grupoeroski.com/supermercado/home.jsp']
rules = [
Rule(LinkExtractor(restrict_xpaths='//div[#class="navmenu"]'), callback='parse_categories')
]
def parse_categories(self, response):
pattern = re.compile(r'/(\d+)\-\w+')
groups = pattern.findall(response.url)
params = {'page': 1, 'categoria': groups.pop(0)}
if groups:
params['grupo'] = groups.pop(0)
if groups:
params['familia'] = groups.pop(0)
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
def parse_products(self, response):
for product in response.xpath('//div[#class="product_element"]'):
item = ProductItem()
item['description'] = product.xpath('.//span[#class="description_1"]/text()').extract()[0]
item['price'] = product.xpath('.//div[#class="precio_line"]/p/text()').extract()[0]
yield item
params = response.meta['params']
params['page'] += 1
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
Hope this is a good starting point for you.
Does anyone know if some Scrapy book that will be released soon?
Nothing specific that I can recall.
Though I heard that some publisher has some plans to may be release a book about web-scraping, but I'm not supposed to tell you that.
After several readings to Scrapy docs I'm still not catching the diferrence between using CrawlSpider rules and implementing my own link extraction mechanism on the callback method.
I'm about to write a new web crawler using the latter approach, but just becuase I had a bad experience in a past project using rules. I'd really like to know exactly what I'm doing and why.
Anyone familiar with this tool?
Thanks for your help!
CrawlSpider inherits BaseSpider. It just added rules to extract and follow links.
If these rules are not enough flexible for you - use BaseSpider:
class USpider(BaseSpider):
"""my spider. """
start_urls = ['http://www.amazon.com/s/?url=search-alias%3Dapparel&sort=relevance-fs-browse-rank']
allowed_domains = ['amazon.com']
def parse(self, response):
'''Parse main category search page and extract subcategory search link.'''
self.log('Downloaded category search page.', log.DEBUG)
if response.meta['depth'] > 5:
self.log('Categories depth limit reached (recursive links?). Stopping further following.', log.WARNING)
hxs = HtmlXPathSelector(response)
subcategories = hxs.select("//div[#id='refinements']/*[starts-with(.,'Department')]/following-sibling::ul[1]/li/a[span[#class='refinementLink']]/#href").extract()
for subcategory in subcategories:
subcategorySearchLink = urlparse.urljoin(response.url, subcategorySearchLink)
yield Request(subcategorySearchLink, callback = self.parseSubcategory)
def parseSubcategory(self, response):
'''Parse subcategory search page and extract item links.'''
hxs = HtmlXPathSelector(response)
for itemLink in hxs.select('//a[#class="title"]/#href').extract():
itemLink = urlparse.urljoin(response.url, itemLink)
self.log('Requesting item page: ' + itemLink, log.DEBUG)
yield Request(itemLink, callback = self.parseItem)
try:
nextPageLink = hxs.select("//a[#id='pagnNextLink']/#href").extract()[0]
nextPageLink = urlparse.urljoin(response.url, nextPageLink)
self.log('\nGoing to next search page: ' + nextPageLink + '\n', log.DEBUG)
yield Request(nextPageLink, callback = self.parseSubcategory)
except:
self.log('Whole category parsed: ' + categoryPath, log.DEBUG)
def parseItem(self, response):
'''Parse item page and extract product info.'''
hxs = HtmlXPathSelector(response)
item = UItem()
item['brand'] = self.extractText("//div[#class='buying']/span[1]/a[1]", hxs)
item['title'] = self.extractText("//span[#id='btAsinTitle']", hxs)
...
Even if BaseSpider's start_urls are not enough flexible for you, override start_requests method.
If you want selective crawling, like fetching "Next" links for pagination etc., it's better to write your own crawler. But for general crawling, you should use crawlspider and filter out the links that you don't need to follow using Rules & process_links function.
Take a look at the crawlspider code in \scrapy\contrib\spiders\crawl.py , it isn't too complicated.