I am scraping some info off of zappos.com, specifically a part of the details page that displays what customers that view the current item have also viewed.
This is one such item listing:
https://www.zappos.com/p/chaco-marshall-tartan-rust/product/8982802/color/725500
The thing is that I discovered that the section that I am scraping appears right away on some items, but on others it will only appear after I have refreshed the page 2 or three times.
I am using scrapy to scrape and splash to render.
import scrapy
import re
from scrapy_splash import SplashRequest
class Scrapys(scrapy.Spider):
name = "sqs"
start_urls = ["https://www.zappos.com","https://www.zappos.com/marty/men-shoes/CK_XAcABAuICAgEY.zso"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)
def parse(self, response):
links = response.css("div._1Mgpu")
for link in links:
url = 'https://www.zappos.com' + link.css("a::attr(href)").extract_first()
yield SplashRequest(url, callback=self.parse_attr,
endpoint='render.html',
args={'wait': 10},
)
def parse_attr(self, response):
alsoviewimg = response.css("div._18jp0 div._3Olkk div.QDcUX div.slider div.slider-frame ul.slider-list li.slider-slide a img").extract()
The alsoviewimg is one of the elements that I am pulling from the "Customers Who Viewed this Item Also Viewed" section. I have tested pulling this and other elements, all in the scrapy shell with splash rendering to get the dynamic content, and it pulled the content fine, however in the spider it rarely, if ever, gets any hits.
Is there something I can set so that it loads the page a couple times to get the content? Or something else that I am missing?
You should check if the element you're looking for exists. If it doesn't, load the page again.
I'd look into why refreshing the page requires multiple attempts, you might be able to solve the problem without this ad-hoc multiple refresh solution.
Scrapy How to check if certain class exists in a given element
This link explains how to see if a class exists.
Related
Hi guys I am very new in scraping data, I have tried the basic one. But my problem is I have 2 web page with same domain that I need to scrape
My Logic is,
First page www.sample.com/view-all.html
*This page open all the list of items and I need to get all the href attr of every item.
Second page www.sample.com/productpage.52689.html
*this is the link came from the first page so 52689 needs to change dynamically depending on the link provided by the first page.
I need to get all the data like title, description etc on the second page.
what I am thinking is for loop but Its not working on my end. I am searching on google but no one has the same problem as mine. please help me
import scrapy
class SalesItemSpider(scrapy.Spider):
name = 'sales_item'
allowed_domains = ['www.sample.com']
start_urls = ['www.sample.com/view-all.html', 'www.sample.com/productpage.00001.html']
def parse(self, response):
for product_item in response.css('li.product-item'):
item = {
'URL': product_item.css('a::attr(href)').extract_first(),
}
yield item`
Inside parse you can yield Request() with url and function's name to scrape this url in different function
def parse(self, response):
for product_item in response.css('li.product-item'):
url = product_item.css('a::attr(href)').extract_first()
# it will send `www.sample.com/productpage.52689.html` to `parse_subpage`
yield scrapy.Request(url=url, callback=self.parse_subpage)
def parse_subpage(self, response):
# here you parse from www.sample.com/productpage.52689.html
item = {
'title': ...,
'description': ...
}
yield item
Look for Request in Scrapy documentation and its tutorial
There is also
response.follow(url, callback=self.parse_subpage)
which will automatically add www.sample.com to urls so you don't have to do it on your own in
Request(url = "www.sample.com/" + url, callback=self.parse_subpage)
See A shortcut for creating Requests
If you interested in scraping then you should read docs.scrapy.org from first page to the last one.
I am making a spider which will crawl the entire site on the first run and store the data in my database.
But I will keep running this spider on weekly basis to get the updates of the crawled site in my database and I don't want scrapy to crawl the pages which are already present in my database how to achieve this I have made two plans -
1] Make a crawler to fetch the entire site and somehow store the first fetched URL in a csv file then keep following the next pages. Then make another crawler which will start fetching backwards that means it will take the input from the URL in csv and keep running till prev_page exits this way I will get the data, but the url in csv will be crawled twice.
2] Make a crawler which will check condition if the data is in the database then stop, is it possible? This will be the most productive way but I can't find the way out. Maybe making logs files might help in some way?
Update
The site is a blog which updates frequently and sorted as latest post on the top manner
Something like this :
from scrapy import Spider
from scrapy.http import Request, FormRequest
class MintSpiderSpider(Spider):
name = 'Mint_spider'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/']
def parse(self, response):
urls = response.xpath('//div[#class = "post-inner post-hover"]/h2/a/#href').extract()
for url in urls:
if never_visited(url, database):
yield Request(url, callback=self.parse_lyrics) #do you mean parse_foo ?
next_page_url = response.xpath('//li[#class="next right"]/a/#href').extract_first()
if next_page_url:
yield scrapy.Request(next_page_url, callback=self.parse)
def parse_foo(self, response):
save_url(response.request.url, database)
info = response.xpath('//*[#class="songinfo"]/p/text()').extract()
name = response.xpath('//*[#id="lyric"]/h2/text()').extract()
yield{
'name' : name,
'info': info
}
You just need to implement never_visited and save_url functions.
never_visited will check in your database if url is already there. save_url will add the url into your database.
I have inspected element from chrome:
I want to get data in the red box (can be more than one) using scrapy. I used this code (I see the tutorial from the scrapy documentation):
import scrapy
class KamusSetSpider(scrapy.Spider):
name = "kamusset_spider"
start_urls = ['http://kbbi.web.id/' + 'abad']
def parse(self, response):
for kamusset in response.css("div#d1"):
text = kamusset.css("div.sub_17 b.tur.highlight::text").extract()
print(dict(text=text))
But, there is no result:
What happen? I have change it to this(use splash) but still not working:
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
html = response.body
for kamusset in response.css("div#d1"):
text = kamusset.css("div.sub_17 b.tur.highlight::text").extract()
print(dict(text=text))
In this case it seems that the page content is generated dynamically --
eventhough you can see the elements present when inspecting from browser, they are not present in the HTML source (i.e. in what Scrapy sees). That's because Scrapy can't render JavaScript etc. You need to use some kind of browser to render the page and then put the result to Scrapy for processing. I recommend using Splash for it's seamless integration with Scrapy.
Although I've seen several similar questions here regarding this, none seem to precisely define the process for achieving this task. I borrowed largely from the Scrapy script located here but since it is over a year old I had to make adjustments to the xpath references.
My current code looks as such:
import scrapy
from tripadvisor.items import TripadvisorItem
class TrSpider(scrapy.Spider):
name = 'trspider'
start_urls = [
'https://www.tripadvisor.com/Hotels-g29217-Island_of_Hawaii_Hawaii-Hotels.html'
]
def parse(self, response):
for href in response.xpath('//div[#class="listing_title"]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_hotel)
next_page = response.xpath('//div[#class="unified pagination standard_pagination"]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse)
def parse_hotel(self, response):
for href in response.xpath('//div[starts-with(#class,"quote")]/a/#href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_review)
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_hotel)
def parse_review(self, response):
item = TripadvisorItem()
item['headline'] = response.xpath('translate(//div[#class="quote"]/text(),"!"," ")').extract()[0][1:-1]
item['review'] = response.xpath('translate(//div[#class="entry"]/p,"\n"," ")').extract()[0]
item['bubbles'] = response.xpath('//span[contains(#class,"ui_bubble_rating")]/#alt').extract()[0]
item['date'] = response.xpath('normalize-space(//span[contains(#class,"ratingDate")]/#content)').extract()[0]
item['hotel'] = response.xpath('normalize-space(//span[#class="altHeadInline"]/a/text())').extract()[0]
return item
When running the spider in its current form, I scrape the first page of reviews for each hotel listed on the start_urls page but the pagination doesn't flip to the next page of reviews. From what I suspect, this is because of this line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
Since these pages load dynamically, there is no existing href for the next page on the current page. Investigating further I've read that these requests are sending a POST request using XHR. By exploring the "Network" tab in Firefox "Inspect" I can see both a Request URL and Form Data that might be needed to flip the page according to other posts on SO regarding the same topic.
However, it seems that the other posts refer to a static URL starting point when trying to pass a FormRequest using Scrapy. With TripAdvisor, the URL will always change based on the name of the hotel we're looking at so I'm not sure how to chose a URL when using FormRequest to submit the form data: reqNum=1&changeSet=REVIEW_LIST (this form data also never seems to change from page to page).
Alternatively, there doesn't appear to be a way to extract the URL shown in the "Network" tab's "Request URL". These pages do have URLs that change from page to page but the way TripAdvisor is set up, I cannot seem to extract them from the source code. The review pages change by incrementing the part of the URL that is -orXX- where "XX" is a number. For example:
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or10-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
https://www.tripadvisor.com/Hotel_Review-g2312116-d113123-Reviews-or15-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html
So, my question is whether or not it is possible to paginate using the XHR request/form data or do I need to manually build a list of URLs for each hotel that adds the -orXX-?
Well I ended up discovering an xpath that apparently allowed pagination of the reviews, but it's funny because every time I checked the underlying HTML the href link never changed from referring to /Hotel_Review-g2312116-d113123-Reviews-or5-Fairmont_Orchid_Hawaii-Puako_Kohala_Coast_Island_of_Hawaii_Hawaii.html even if I was on page 10 for example. It seems the "-orXX-" part of the link always increments the XX by 5 so I'm not sure why this works.
All I did was change the line:
next_page = response.xpath('//div[#class="unified pagination "]/child::*[2][self::a]/#href')
to:
next_page = response.xpath('//link[#rel="next"]/#href')
and have >41K extracted reviews. Would love to get other's opinions on handling this problem in other situations.
I'm writing a scrapy application that crawls a website main page, saves his url, and also checks for his menu items, so it would do the same process recursively, to them.
class NeatSpider(scrapy.Spider):
name = "example"
start_urls = ['https://example.com/main/']
def parse(self, response):
url = response.url
yield url
# check for in-menu article links
menu_items = response.css(MENU_BAR_LINK_ITEM).extract()
if menu_items is not None:
for menu_item in menu_items:
yield scrapy.Request(response.urljoin(menu_item), callback=self.parse)
In the example website, each menu item leads to another page with another menu items.
Some pages responses get to the 'parse' method, and so their url gets saved, while others not.
Those who are not, are giving back a 200 status (when I enter their address manually, in the browser), don't throw any exceptions, and pretty much shows the same behavior as other pages who do get to the parse method.
Addition Information: ALL of the menu items get to the last line in code (without any errors), and if I provide an 'errback' callback method, no request ever gets there.
EDIT: here is the log: http://pastebin.com/2j5HMkqN
There might be chance that the website you are scraping showing Captchas.
You can debug your scraper like this, this will open the scraped webpage in your OS default browser.
from scrapy.utils.response import open_in_browser
def parse_details(self, response):
if "item name" not in response.body:
open_in_browser(response)