handle redirected response with proper parser - python

I am crawling a site with scrapy. The parse method first extracts all the category links and then dispatch a request with callback to parse_category.
The problem is if any of the category has one product it redirects to the products page. And my parse_category fails to recognize this page.
Now how do I parse that redirectted category page with product page parser?
Here is an example.
parse finds 3 category pages.
http://example.com/products/samsung
http://example.com/products/dell
http://example.com/products/apple
pare_category calls all those pages. Each returns a html page with list of product. But apple has one single product iMac 27". So it redirects to http://example.com/products/apple/imac_27. This is a product page.The category parse fails to parse it.
I need the product parse method parse_product should be called in this scenario. How do I do that?
I can add some logic in my parse_category method and call parse_product. I dont want it. I want scrapy will do it. But yes, I'll give url patterns or any other info necessary.
Here is the code.
class ExampleSpider(BaseSpider):
name = u'example.com'
allowed_domains = [u'www.example.com']
start_urls = [u'http://www.example.com/category.aspx']
def parse(self, response):
hxs = HtmlXPathSelector(response)
anchors = hxs.select('/xpath')
for anchor in anchors:
yield Request(urljoin(get_base_url(response), anchor), callback=self.parse_category)
def parse_category(self, response):
hxs = HtmlXPathSelector(response)
products = hxs.select(products_xpath).extract()
for url in products:
yield Request(url, callback=self.parse_product)
def parse_product(self, response):
# product parsing ...
pass

You can opt to write a middleware which implements the process_response method. Whenever your response is for a product URL instead of a category, create a copy of the Request object and change the callback function to your product parser.
In the end, return the new Request object from the middleware. Note: You might need to set dont_filter to True for the new Request to ensure the DupeFilter doesn't filter the Request.

Related

Scrapy: How To Start Scraping Data From a Search Result that uses Javascript

I am new at using scrapy and python
I wanted to start scraping data from a search result, if you will load the page the default content will appear, what I need to scrape is the filtered one, while doing pagination?
Here's the URL
https://teslamotorsclub.com/tmc/post-ratings/6/posts
I need to scrape the item from Time Filter: "Today" result
I tried different approach but none is working.
What I have done is this but more on layout structure.
class TmcnfSpider(scrapy.Spider):
name = 'tmcnf'
allowed_domains = ['teslamotorsclub.com']
start_urls = ['https://teslamotorsclub.com/tmc/post-ratings/6/posts']
def start_requests(self):
#Show form from a filtered search result
def parse(self, response):
#some code scraping item
#Yield url for pagination
To get the posts of todays filter, you need to send a post request to this url https://teslamotorsclub.com/tmc/post-ratings/6/posts along with payload. The following should fetch you the results you are interested in.
import scrapy
class TmcnfSpider(scrapy.Spider):
name = "teslamotorsclub"
start_urls = ["https://teslamotorsclub.com/tmc/post-ratings/6/posts"]
def parse(self,response):
payload = {'time_chooser':'4','_xfToken':''}
yield scrapy.FormRequest(response.url,formdata=payload,callback=self.parse_results)
def parse_results(self,response):
for items in response.css("h3.title > a::text").getall():
yield {"title":items.strip()}

How to scrape 2 web page with same domain on scrapy using python?

Hi guys I am very new in scraping data, I have tried the basic one. But my problem is I have 2 web page with same domain that I need to scrape
My Logic is,
First page www.sample.com/view-all.html
*This page open all the list of items and I need to get all the href attr of every item.
Second page www.sample.com/productpage.52689.html
*this is the link came from the first page so 52689 needs to change dynamically depending on the link provided by the first page.
I need to get all the data like title, description etc on the second page.
what I am thinking is for loop but Its not working on my end. I am searching on google but no one has the same problem as mine. please help me
import scrapy
class SalesItemSpider(scrapy.Spider):
name = 'sales_item'
allowed_domains = ['www.sample.com']
start_urls = ['www.sample.com/view-all.html', 'www.sample.com/productpage.00001.html']
def parse(self, response):
for product_item in response.css('li.product-item'):
item = {
'URL': product_item.css('a::attr(href)').extract_first(),
}
yield item`
Inside parse you can yield Request() with url and function's name to scrape this url in different function
def parse(self, response):
for product_item in response.css('li.product-item'):
url = product_item.css('a::attr(href)').extract_first()
# it will send `www.sample.com/productpage.52689.html` to `parse_subpage`
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
# here you parse from www.sample.com/productpage.52689.html
item = {
'title': ...,
'description': ...
}
yield item
Look for Request in Scrapy documentation and its tutorial
There is also
response.follow(url, callback=self.parse_subpage)
which will automatically add www.sample.com to urls so you don't have to do it on your own in
Request(url = "www.sample.com/" + url, callback=self.parse_subpage)
See A shortcut for creating Requests
If you interested in scraping then you should read docs.scrapy.org from first page to the last one.

Can scrapy submit to an input based on id?

I have an Intranet page with multiple input fields, I need Scrapy to run a search using the webpages "search for products" input field, it has an id of "searchBox"
I have been able to lock onto the correct search box using both Scrapy and Beautiful Soup but I am not sure how to pass that data back to Scrapys form submission function correctly.
In Method 1 I have tried to simply pass the results to Scrapys FormRequest.from_response function as an input but it does not work.
Method 1 - Using Scrapy to find the data
#Search for products
def parse(self, response):
##Let's try search using scrapy only
sel = Selector(response)
results = sel.xpath("//*[contains(#id, 'searchBox')]")
for result in results:
print (result.extract()) #Print out what scrapy found
return scrapy.FormRequest.from_response(results, formdata = {'Item': 'Whirlpool Washing Machine'}) #formdata is the data we are sending
Method 2 - Using Beautiful soup to find the data
#Search for products
def parse(self, response):
##Let's try search using Beautiful Soup only
soup = BeautifulSoup(response.text, 'html.parser')
product_search = []
product_search.append(soup.find("input", id="searchBox"))
print(product_search) #Print what BS found
About scrapy variant:
You should yield request, not return.
In function from_response you should use selector of form as first argument. Now you pass there some input data, as far as I could understand from your code.
Try something like:
yield scrapy.FormRequest.from_response(response.css('form'), formdata={'Item': 'Whirlpool Washing Machine'})
Just fix form selector in this expression. Also check what else should be used in this request, maybe some headers, cookies, etc.

Scrapy crawler to parse data recursively can not call back

I am a newbie and I've written a script in python scrapy to get information recursively.
Firstly, it scrapes links of city including information of tours then it tracks down each cities and reach their pages. Next, it get needed information of tours related to city before move to next pages then so on. Pagination is running on java-script without visible link.
The command I used to get the result along with a csv output is:
scrapy crawl pratice -o practice.csv -t csv
The expected result is csv file:
title, city, price, tour_url
t1, c1, p1, url_1
t2, c2, p2, url_2
...
The problem is that csv file is empty. The running is stopped at "parse_page" and callback="self.parse_item" doesn't work. I don't know how to fix it. Maybe my workflow is invalid or my code has issues. Thanks for your help.
name = 'practice'
start_urls = ['https://www.klook.com/vi/search?query=VI%E1%BB%86T%20NAM%20&type=country',]
def parse(self, response): # Extract cities from country
hxs = HtmlXPathSelector(response)
urls = hxs.select("//div[#class='swiper-wrapper cityData']/a/#href").extract()
for url in urls:
url = urllib.parse.urljoin(response.url, url)
self.log('Found city url: %s' % url)
yield response.follow(url, callback=self.parse_page) # Link to city
def parse_page(self, response): # Move to next page
url_ = response.request.url
yield response.follow(url_, callback=self.parse_item)
# I will use selenium to move next page because of next button is running
# on javascript without fixed url.
def parse_item(self, response): # Extract tours
for block in response.xpath("//div[#class='m_justify_list m_radius_box act_card act_card_lg a_sd_move j_activity_item js-item ']"):
article = {}
article['title'] = block.xpath('.//h3[#class="title"]/text()').extract()
article['city'] = response.xpath(".//div[#class='g_v_c_mid t_mid']/h1/text()").extract()# fixed
article['price'] = re.sub(" +","",block.xpath(".//span[#class='latest_price']/b/text()").extract_first()).strip()
article['tour_url'] = 'www.klook.com'+block.xpath(".//a/#href").extract_first()
yield article
hxs = HtmlXPathSelector(response) #response is already in Selector, use direct `response.xpath`
url = urllib.parse.urljoin(response.url, url)
use as:
url = response.urljoin(url)
yes it will stop as its a duplicate request to prev. url, you need to add dont_filter=True check
Instead of using Selenium, figure out what request the website performs using JavaScript (watch the Network tab of the developer tools of your browser while you navigate) and reproduce a similar request.
The website uses JSON requests undernead to fetch the items, which is much easier to parse than the HTML.
Also, if you are not familiar with Scrapy’s asynchronous nature, you are likely to get unexpected issues while using it in combination with Selenium.
Solutions like Splash or Selenium are only meant to be used as last resource, when everything else fails.

Scrapy: Spider optimization

I'm trying to scrap an e-commerce web site, and I'm doing it in 2 steps.
This website has a structure like this:
The homepage has the links to the family-items and subfamily-items pages
Each family & subfamily page has a list of products paginated
Right now I have 2 spiders:
GeneralSpider to get the homepage links and store them
ItemSpider to get elements from each page
I'm completely new to Scrapy, I'm following some tutorials to achieve this. I'm wondering how complex can be the parse functions and how rules works. My spiders right now looks like:
GeneralSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
start_urls = ['http://www.domain.org/home']
def parse(self, response):
links = LinksItem()
links['content'] = response.xpath("//div[#id='h45F23']").extract()
return links
ItemSpider:
class GeneralSpider(CrawlSpider):
name = 'domain'
allowed_domains = ['domain.org']
f = open("urls.txt")
start_urls = [url.strip() for url in f.readlines()]
# Each URL in the file has pagination if it has more than 30 elements
# I don't know how to paginate over each URL
f.close()
def parse(self, response):
item = ShopItem()
item['name'] = response.xpath("//h1[#id='u_name']").extract()
item['description'] = response.xpath("//h3[#id='desc_item']").extract()
item['prize'] = response.xpath("//div[#id='price_eur']").extract()
return item
Wich is the best way to make the spider follow the pagination of an url ?
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
Can I have different "rules" in the same spider to scrap different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
I've also googled looking for any book related with Scrapy, but it seems there isn't any finished book yet, or at least I couldn't find one.
Does anyone know if some Scrapy book that will be released soon ?
Edit:
This 2 URL's fits for this example. In the Eroski Home page you can get the URL's to the products page.
In the products page you have a list of items paginated (Eroski Items):
URL to get Links: Eroski Home
URL to get Items: Eroski Fruits
In the Eroski Fruits page, the pagination of the items seems to be JQuery/AJAX, because more items are shown when you scroll down, is there a way to get all this items with Scrapy ?
Which is the best way to make the spider follow the pagination of an url ?
This is very site-specific and depends on how the pagination is implemented.
If the pagination is JQuery, meaning there is no GET variable in the URL, Would be possible to follow the pagination ?
This is exactly your use case - the pagination is made via additional AJAX calls that you can simulate inside your Scrapy spider.
Can I have different "rules" in the same spider to scrape different parts of the page ? or is better to have the spiders specialized, each spider focused in one thing?
Yes, the "rules" mechanism that a CrawlSpider provides is a very powerful piece of technology - it is highly configurable - you can have multiple rules, some of them would follow specific links that match specific criteria, or located in a specific section of a page. Having a single spider with multiple rules should be preferred comparing to having multiple spiders.
Speaking about your specific use-case, here is the idea:
make a rule to follow categories and subcategories in the navigation menu of the home page - this is there restrict_xpaths would help
in the callback, for every category or subcategory yield a Request that would mimic the AJAX request sent by your browser when you open a category page
in the AJAX response handler (callback) parse the available items and yield an another Request for the same category/subcategory but increasing the page GET parameter (getting next page)
Example working implementation:
import re
import urllib
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class ProductItem(scrapy.Item):
description = scrapy.Field()
price = scrapy.Field()
class GrupoeroskiSpider(CrawlSpider):
name = 'grupoeroski'
allowed_domains = ['compraonline.grupoeroski.com']
start_urls = ['http://www.compraonline.grupoeroski.com/supermercado/home.jsp']
rules = [
Rule(LinkExtractor(restrict_xpaths='//div[#class="navmenu"]'), callback='parse_categories')
]
def parse_categories(self, response):
pattern = re.compile(r'/(\d+)\-\w+')
groups = pattern.findall(response.url)
params = {'page': 1, 'categoria': groups.pop(0)}
if groups:
params['grupo'] = groups.pop(0)
if groups:
params['familia'] = groups.pop(0)
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
def parse_products(self, response):
for product in response.xpath('//div[#class="product_element"]'):
item = ProductItem()
item['description'] = product.xpath('.//span[#class="description_1"]/text()').extract()[0]
item['price'] = product.xpath('.//div[#class="precio_line"]/p/text()').extract()[0]
yield item
params = response.meta['params']
params['page'] += 1
url = 'http://www.compraonline.grupoeroski.com/supermercado/ajax/listProducts.jsp?' + urllib.urlencode(params)
yield scrapy.Request(url,
meta={'params': params},
callback=self.parse_products,
headers={'X-Requested-With': 'XMLHttpRequest'})
Hope this is a good starting point for you.
Does anyone know if some Scrapy book that will be released soon?
Nothing specific that I can recall.
Though I heard that some publisher has some plans to may be release a book about web-scraping, but I'm not supposed to tell you that.

Categories