I have written a scrapy crawlspider to crawl a site with a structure like category page > type page > list page > item page. On the category page there are many categories of machines each of which has a type page with lots of types, each of the different types has a list of items, then finally each machine has a page with info about it.
My spider has a rule to get from the home page to the category page where I define the callback parsecatpage, this generates an item, grabs the category and yields a new request for each category on the page. I pass the item and the category name with request.meta and specify the callback is parsetype page.
Parsetypepage gets the item from response.meta then yields requests for each type and passes the item, and the concatenation of category and type along with it in the request.meta. The callback is parsemachinelist.
Parsemachinelist gets the item from response.meta then yields requests for each item on the list and passes the item, category/type, description via request.meta to the final callback, parsemachine. This gets the meta attributes and populates all the fields in the item using the info on the page and the info that was passed from the previous pages and finally yields an item.
If I limit this to a single category and type (with for example contains[#href, "filter=c:Grinders"] and contains[#href, "filter=t:Disc+-+Horizontal%2C+Single+End"]) then it works and there is a machine item for each machine on the final page. The problem is that once I allow the spider to scrapy all the categories and all the types it only returns scrapy items for the machines on the first of the final pages it gets to and once it has done that the spider is finished and doesn't get the other categories etc.
Here is the (anonymous) code
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http import Request
from myspider.items import MachineItem
import urlparse
class MachineSpider(CrawlSpider):
name = 'myspider'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/index.php']
rules = (
Rule(SgmlLinkExtractor(allow_domains=('example.com'),allow=('12\.html'),unique=True),callback='parsecatpage'),
)
def parsecatpage(self, response):
hxs = HtmlXPathSelector(response)
#this works, next line doesn't categories = hxs.select('//a[contains(#href, "filter=c:Grinders")]')
categories = hxs.select('//a[contains(#href, "filter=c:Grinders") or contains(#href, "filter=c:Lathes")]')
for cat in categories:
item = MachineItem()
req = Request(urlparse.urljoin(response.url,''.join(cat.select("#href").extract()).strip()),callback=self.parsetypepage)
req.meta['item'] = item
req.meta['machinecategory'] = ''.join(cat.select("./text()").extract())
yield req
def parsetypepage(self, response):
hxs = HtmlXPathSelector(response)
#this works, next line doesn't types = hxs.select('//a[contains(#href, "filter=t:Disc+-+Horizontal%2C+Single+End")]')
types = hxs.select('//a[contains(#href, "filter=t:Disc+-+Horizontal%2C+Single+End") or contains(#href, "filter=t:Lathe%2C+Production")]')
for typ in types:
item = response.meta['item']
req = Request(urlparse.urljoin(response.url,''.join(typ.select("#href").extract()).strip()),callback=self.parsemachinelist)
req.meta['item'] = item
req.meta['machinecategory'] = ': '.join([response.meta['machinecategory'],''.join(typ.select("./text()").extract())])
yield req
def parsemachinelist(self, response):
hxs = HtmlXPathSelector(response)
for row in hxs.select('//tr[contains(td/a/#href, "action=searchdet")]'):
item = response.meta['item']
req = Request(urlparse.urljoin(response.url,''.join(row.select('./td/a[contains(#href,"action=searchdet")]/#href').extract()).strip()),callback=self.parsemachine)
print urlparse.urljoin(response.url,''.join(row.select('./td/a[contains(#href,"action=searchdet")]/#href').extract()).strip())
req.meta['item'] = item
req.meta['descr'] = row.select('./td/div/text()').extract()
req.meta['machinecategory'] = response.meta['machinecategory']
yield req
def parsemachine(self, response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['machinecategory'] = response.meta['machinecategory']
item['comp_name'] = 'Name'
item['description'] = response.meta['descr']
item['makemodel'] = ' '.join([''.join(hxs.select('//table/tr[contains(td/strong/text(), "Make")]/td/text()').extract()),''.join(hxs.select('//table/tr[contains(td/strong/text(), "Model")]/td/text()').extract())])
item['capacity'] = hxs.select('//tr[contains(td/strong/text(), "Capacity")]/td/text()').extract()
relative_image_url = hxs.select('//img[contains(#src, "custom/modules/images")]/#src')[0].extract()
abs_image_url = urlparse.urljoin(response.url, relative_image_url.strip())
item['image_urls'] = [abs_image_url]
yield item
SPIDER = MachineSpider()
So for example the spider will find Grinders on the category page and go to the Grinder type page where it will find the Disc Horizontal Single End type, then it will go to that page and find the list of machines and go to each machines page and finally there will be an item for each machine. If you try and go to Grinders and Lathes though it will run through the Grinders fine then it will crawl the Lathes and Lathes type pages and stop there without generating the requests for the Lathes list page and the final Lathes pages.
Can anyone help with this? Why isn't the spider getting to the second (or third etc.) machine list page once there is more than one category of machine?
Sorry for the epic post, just trying to explain the problem!!
Thanks!!
You should print the url of the request, to be sure it's ok. Also you can try this version:
def parsecatpage(self, response):
hxs = HtmlXPathSelector(response)
categories = hxs.select('//a[contains(#href, "filter=c:Grinders") or contains(#href, "filter=c:Lathes")]')
for cat in categories:
item = MachineItem()
cat_url = urlparse.urljoin(response.url, cat.select("./#href").extract()[0])
print 'url:', cat_url # to see what's there
cat_name = cat.select("./text()").extract()[0]
req = Request(cat_url, callback=self.parsetypepage, meta={'item': item, 'machinecategory': cat_name})
yield req
The problem was that the website is set up so that moving from the category to type page (and the following pages) occurs via filtering the results that are shown. This means that if the requests are performed depth first to the bottom of the query then it works (i.e. choose a category, then get all the types of that category, then get all the machines in each type then scrape the page of each machine) but if a request for the next type page is processed before the spider has got the urls for each machine in the first type then the urls are no longer correct and the spider reaches an incorrect page and cannot extract the info for the next step.
To solve the problem I defined a category setup callback which is called the first time only and gets a list of all categories called categories, then a category callback which is called from category setup which starts the crawl with a single category only using categories.pop(). Once the spider has got to the bottom of the nested callbacks and scraped all the machines in the list there is a callback back up to the category callback again (needed dont_follow=True in the Request) where categories.pop() starts the process again with the next category in the list until they are all done. This way each category is treated fully before the next is started and it works.
Thanks for your final comment, that got me thinking along the right lines and led me to the solution!
Related
I'm trying to scrape all the data from a website called quotestoscrape. But, When I try to run my code it's only getting the one random quote. It should take at least all the data from that page only but it's only taking one. Also, if somehow I get the data from page 1 now what I want is to get the data from all the pages.
So how do I solve this error(which should take all the data from the page1)?
How do I take all the data which is present on the next pages?
items.py file
import scrapy
class QuotetutorialItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
tag = scrapy.Field()
quotes_spider.py file
import scrapy
from ..items import QuotetutorialItem
class QuoteScrapy(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/'
]
def parse(self, response):
items = QuotetutorialItem()
all_div_quotes = response.css('div.quote')
for quotes in all_div_quotes:
title = quotes.css('span.text::text').extract()
author = quotes.css('.author::text').extract()
tag = quotes.css('.tag::text').extract()
items['title'] = title
items['author'] = author
items['tag'] = tag
yield items
Please tell me what change I can do?
As reported, it's missing an ident level on your yield. And to follow next pages, just add a check for the next button, and yield a request following it.
import scrapy
class QuoteScrapy(scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/'
]
def parse(self, response):
items = {}
all_div_quotes = response.css('div.quote')
for quotes in all_div_quotes:
title = quotes.css('span.text::text').extract()
author = quotes.css('.author::text').extract()
tag = quotes.css('.tag::text').extract()
items['title'] = title
items['author'] = author
items['tag'] = tag
yield items
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page:
yield response.follow(next_page)
As #LanteDellaRovere has correctly identified in a comment, the yield statement should be executed for each iteration of the for loop - which is why you are only seeing a single (presumably the last) link from each page.
As far as reading the continued pages, you could extract it from the <nav> element at the bottom of the page, but the structure is very simple - the links (when no tag is specified) are of the form
http://quotes.toscrape.com/page/N/
You will find that for N=1 you get the first page. So just access the URLs for increasing values of N until the attempt sees a 404 return should work as a simplistic solution.
Not knowing much about Scrapy I can't give you exact code, but the examples at https://docs.scrapy.org/en/latest/intro/tutorial.html#following-links are fairly helpful if you want a more sophisticated and Pythonic approach.
I have a website and I would like to find a webpage with information about job vacancies. There is only one page usually with such information. So I start crawling with website and I manage to get all webpages up to certain depth. It works. But they are many times duplicated. Instead of lets say 45 pages I get 1000 pages. I know the reason why. The reason is that every time I call my "parse" function, it parses all the webpages on a certain webpage. So when I come to a new webpage, it crawls all webpages, out of which some have been crawled before.
1) I tried to make "items=[]" list out of parse function but I get some global error. I don't know how to get a list of unique webpages. When I have one, I will be able to choose the right one with simple url parsing.
2) I also tried to have "Request" and "return items" in the "parse" function, but I get syntax error: return inside generator.
I am using DEPTH_LIMIT. Do I really need to use Rules ?
code:
import scrapy, urlparse, os
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import JobItem
from scrapy.utils.response import get_base_url
from scrapy.http import Request
from urlparse import urljoin
from datetime import datetime
class JobSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["www.gen-i.si"]
start_urls = ["http://www.gen-i.si"]
def parse(self, response):
response.selector.remove_namespaces() #
urls = response.xpath('//#href').extract()#choose all "href", either new websites either webpages on our website
items = []
base_url = get_base_url(response) #base url
for url in urls:
#we need only webpages, so we remove all websites and urls with strange characters
if (url[0:4] != "http") and not any(x in url for x in ['%', ':', '?', '&']):
item = JobItem()
absolute_url = urlparse.urljoin(base_url,url)
item["link"] = absolute_url
if item not in items:
items.append(item)
yield item
yield Request(absolute_url, callback = self.parse)
#return items
You're appending item (a newly instantiated object), to your list items. Since item is always a new JobItem() object, it will never exist in your list items.
To illustrate:
>>> class MyItem(object):
... pass
...
>>> a = MyItem()
>>> b = MyItem()
>>> a.url = "abc"
>>> b.url = "abc"
>>> a == b
False
Just because they have one attribute that is the same, doesn't mean they are the same object.
Even if this worked though, you're resetting the list items everytime you call parse (ie. for each request), so you'll never really remove duplicates.
Instead, you would be better checking vs. the absolute_url itself, and putting the list at the spider level:
class JobSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["www.gen-i.si"]
start_urls = ["http://www.gen-i.si"]
all_urls = []
def parse(self, response):
# remove "items = []"
...
for url in urls:
if (url[0:4] != "http") and not any(x in url for x in ['%', ':', '?', '&']):
absolute_url = urlparse.urljoin(base_url, url)
if absolute_url not in self.all_urls:
self.all_urls.append(absolute_url)
item = JobItem()
item['link'] = absolute_url
yield item
yield Request(absolute_url, callback = self.parse)
This functionality, however, would be better served by creating a Dupefilter instead (see here for more information). Additionally, I agree with #RodrigoNey, a CrawlSpider would likely better serve your purpose, and be more maintainable in the long run.
I'm working on a web crawler and ended up making a list of links that needed to be crawled, then once we went there it was deleted from that list and added to the crawled list. then you can use a not in search to either add/delete/etc.
I'm trying to scrap an e-commerce web site, and I'm doing it in 2 steps.
This website has a structure like this:
The homepage has the links to the family-items and subfamily-items pages
Each family & subfamily page has a list of products paginated
Right now I have 2 spiders:
GeneralSpider to get the homepage links and store them
ItemSpider to get elements from each page
I'm completely new to Scrapy, I'm following some tutorials to achieve this. I'm wondering how complex can be the parse functions and how rules works. My spiders right now looks like:
GeneralSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
start_urls = ['http://www.domain.org/home']
def parse(self, response):
links = LinksItem()
links['content'] = response.xpath("//div[#id='h45F23']").extract()
return links
ItemSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
# Each URL in the file has pagination if it has more than 30 elements
# I don't know how to paginate over each URL
f.close()
def parse(self, response):
item = ShopItem()
item['name'] = response.xpath("//h1[#id='u_name']").extract()
item['description'] = response.xpath("//h3[#id='desc_item']").extract()
item['prize'] = response.xpath("//div[#id='price_eur']").extract()
return item
Wich is the best way to make the spider follow the pagination of an url ?
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
Can I have different "rules" in the same spider to scrap different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
I've also googled looking for any book related with Scrapy, but it seems there isn't any finished book yet, or at least I couldn't find one.
Does anyone know if some Scrapy book that will be released soon ?
Edit:
This 2 URL's fits for this example. In the Eroski Home page you can get the URL's to the products page.
In the products page you have a list of items paginated (Eroski Items):
URL to get Links: Eroski Home
URL to get Items: Eroski Fruits
In the Eroski Fruits page, the pagination of the items seems to be JQuery/AJAX, because more items are shown when you scroll down, is there a way to get all this items with Scrapy ?
Which is the best way to make the spider follow the pagination of an url ?
This is very site-specific and depends on how the pagination is implemented.
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
This is exactly your use case - the pagination is made via additional AJAX calls that you can simulate inside your Scrapy spider.
Can I have different "rules" in the same spider to scrape different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
Yes, the "rules" mechanism that a CrawlSpider provides is a very powerful piece of technology - it is highly configurable - you can have multiple rules, some of them would follow specific links that match specific criteria, or located in a specific section of a page. Having a single spider with multiple rules should be preferred comparing to having multiple spiders.
Speaking about your specific use-case, here is the idea:
make a rule to follow categories and subcategories in the navigation menu of the home page - this is there restrict_xpaths would help
in the callback, for every category or subcategory yield a Request that would mimic the AJAX request sent by your browser when you open a category page
in the AJAX response handler (callback) parse the available items and yield an another Request for the same category/subcategory but increasing the page GET parameter (getting next page)
Example working implementation:
import re
import urllib
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class ProductItem(scrapy.Item):
description = scrapy.Field()
price = scrapy.Field()
class GrupoeroskiSpider(CrawlSpider):
name = 'grupoeroski'
allowed_domains = ['compraonline.grupoeroski.com']
start_urls = ['http://www.compraonline.grupoeroski.com/supermercado/home.jsp']
rules = [
Rule(LinkExtractor(restrict_xpaths='//div[#class="navmenu"]'), callback='parse_categories')
]
def parse_categories(self, response):
pattern = re.compile(r'/(\d+)\-\w+')
groups = pattern.findall(response.url)
params = {'page': 1, 'categoria': groups.pop(0)}
if groups:
params['grupo'] = groups.pop(0)
if groups:
params['familia'] = groups.pop(0)
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
def parse_products(self, response):
for product in response.xpath('//div[#class="product_element"]'):
item = ProductItem()
item['description'] = product.xpath('.//span[#class="description_1"]/text()').extract()[0]
item['price'] = product.xpath('.//div[#class="precio_line"]/p/text()').extract()[0]
yield item
params = response.meta['params']
params['page'] += 1
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
Hope this is a good starting point for you.
Does anyone know if some Scrapy book that will be released soon?
Nothing specific that I can recall.
Though I heard that some publisher has some plans to may be release a book about web-scraping, but I'm not supposed to tell you that.
I am getting confused on how to design the architecure of crawler.
I have the search where I have
pagination: next page links to follow
a list of products on one page
individual links to be crawled to get the description
I have the following code:
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[#id=\'result-set\']/li')
items = []
for site in sites[:2]:
item = MyProduct()
item['product'] = myfilter(site.select('h2/a').select("string()").extract())
item['product_link'] = myfilter(site.select('dd[2]/').select("string()").extract())
if item['profile_link']:
request = Request(urljoin('http://www.example.com', item['product_link']),
callback = self.parseItemDescription)
request.meta['item'] = item
return request
soup = BeautifulSoup(response.body)
mylinks= soup.find_all("a", text="Next")
nextlink = mylinks[0].get('href')
yield Request(urljoin(response.url, nextlink), callback=self.parse_page)
The problem is that I have two return statements: one for request, and one for yield.
In the crawl spider, I don't need to use the last yield, so everything was working fine, but in BaseSpider I have to follow links manually.
What should I do?
As an initial pass (and based on your comment about wanting to do this yourself), I would suggest taking a look at the CrawlSpider code to get an idea of how to implement its functionality.
I am scraping the job sites where the first page ahs the links to all the jobs.
Now i am storing the title , job , company from the first page.
But i also want to store the description , which is available by clicking on the job title. I want to store that as well with the current items.
This is my curent code
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select("//div[#class='jobenteries']")
items = []
for site in sites[:3]:
print "Hello"
item = DmozItem()
item['title'] = site.select('a/text()').extract()
item['desc'] = ''
items.append(item)
return items
But that description is on the next page link. how can i do that
From the first page, return Requests for the second page and pass the data for each item in the request.meta dict. On the callback method for the second page you can read the data you passed and return the fully populated item.
See Passing additional data to callback functions in the scrapy docs for more details and an example.