I am trying to scrape reviews and their star ratings for various products from Snapdeal. I am accessing the product through the product URL. On the specific page, I want to filter out ratings according to stars and fetch the rating number as well as review. I am using the following code to do so
'''
url_snapdeal=('https://www.snapdeal.com/')
driver.get(url_snapdeal)
time.sleep(2)
search = driver.find_element_by_id('inputValEnter')
search.clear()
search.send_keys('smartphone')
search.send_keys(Keys.ENTER)
time.sleep(2)
for i in range(0,3):
driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
time.sleep(1)
urls=[]
for link in driver.find_elements_by_xpath("//div[#class='product-desc-rating ']/a"):
urls.append(link.get_attribute('href'))
snap_reviews=[]
snap_ratings=[]
for url in urls:
driver.get(url)
time.sleep(4)
try:
for x in range(2,7):
driver.find_element_by_xpath("//div[#class='selectarea']").click()
time.sleep(1)
driver.find_element_by_xpath(f"//div[#class='options']/ul/li[{x}]").click()
time.sleep(1)
for rating in driver.find_elements_by_xpath("//div[#class='user-review']/div[1]"):
stars = rating.find_elements_by_xpath("i[#class='sd-icon sd-icon-star active']")
snap_ratings.append(len(stars))
except NoSuchElementException:
print('Not found')
The try block is supposed to click on the star filter dropdown and select 5star, collect the star rating and review text, again click on the dropdown, click on 4star and collect rating and review, and so on.
My code manages to click on the dropdown but is unable to click on filter options like 5star, 4 star etc. It throws ElementNotInteractable Exception.
Any help or suggestion would be greatly appreciated. Thanks in advance.
You actually can get the ratings directly by product number. So get the product numbers and feed that in (I haven't looked, but it might be possible to get those without selenium as well). Then you could just filter the dataframe. Here's and example for 1 product:
import requests
import pandas as pd
import math
productId = 639365186960
url = 'https://www.snapdeal.com/acors/web/getSelfieList/v2'
payload = {
'productId':productId,
'offset':0}
jsonData = requests.get(url, params=payload).json()
total_pages = math.ceil(jsonData['selfieTotal'] / 10)
for page in range(1,total_pages+1):
if page == 1:
ratings = jsonData['selfieList']
else:
payload['offset'] = 10*(page-1)
jsonData = requests.get(url, params=payload).json()
ratings += jsonData['selfieList']
df = pd.DataFrame(ratings)
Output:
df[df['rating'] == 4]
Out[82]:
selfieId ... reducedImage
0 015cd9a6a1e80000dd22850445ac4f71 ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
1 015bffb8bcb60000dd22850466cc1c72 ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
6 015b4dd88df70000dd228504c9271488 ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
7 015b4dd7f9b80000dd228504b8777fc8 ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
8 015b1e574cdc0000dd228504d4a694ad ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
9 015b182be5640000dd22850418c0bdd6 ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
10 015aa6e8700e0000dd228504a3378958 ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
11 015a9df9ab640000dd2285045069dcff ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
14 015a4c7a37040000dd2285045daaa6d3 ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
15 015a4b377b8b0000dd228504d3dd159b ... https://n1.sdlcdn.com/image/upload/h_162,w_162...
[10 rows x 9 columns]
Related
rating=[]
for i in range(0,10):
url = "https://www.yelp.com/biz/snow-show-flushing?osq=ice%20cream%20shop&start="+str(10*i)
ourUrl = urllib.request.urlopen(url)
soup = BeautifulSoup(ourUrl,'html.parser')
for r in soup.find_all('span',{'class':"display--inline__373c0__1gaV4 border-color--default__373c0__1yxBb"})[1:]:
per_rating = r.div.get('aria-label')
rating.append(per_rating)
Try to get ratings for each page. Should have only 58 ratings in total, but it includes the rating from the "you might also consider".
How to fix it.
One possible solution would be to retrieve the total number of Reviews from yelp using BeautifulSoup. You can then trim your "rating"-list by the number of reviews.
# find the total number of reviews:
regex_count = re.compile('.*css-foyide.*')
Review_count = soup.find_all("p", {"class": regex_count})
Review_count = Review_count[0].text
Review_count = int(Review_count.split()[0]) # total number of reviews
I am trying to extract the star rating of each review in a dataframe for sentiment analysis.
https://www.mouthshut.com/product-reviews/Kotak-811-Mobile-Banking-reviews-925917218
This the webpage I am trying to scrape. I am fairly new to webscraping, so I prefer beautifulsoup as it is easier to understand.
import requests
from bs4 import BeautifulSoup
import pandas as pd
URL = ""
Final = []
for x in range(0, 8):
if x == 1:
URL = "https://www.mouthshut.com/product-reviews/Kotak-811-Mobile-Banking-reviews-925917218"
else:
URL ="https://www.mouthshut.com/product-reviews/Kotak-811-Mobile-Banking-reviews-925917218-page-{}".format(x)
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html.parser')
reviews = [] # a list to store reviews
# Use a CSS selector to extract all the review containers
review_divs = soup.select('div.col-10.review')
for element in review_divs :
review = {'Review_Title': element .a.text, 'URL': element .a['href'], 'Review': element .find('div', {'class': ['more', 'reviewdata']}).text.strip()}
reviews.append(review)
Final.extend(reviews)
df = pd.DataFrame(Final)
I would really appreciate the help.
Thank You
You may add the following entry to your review dictionary to get all the
giving stars under class=rating.
'Stars' : len(element.find('div', "rating").findAll("i", "rated-star"))
Review_Title ... Stars
0 Why need permission for contact, gallery ... 1
1 Very dull marketing for open account ... 1
2 Worst bank ... 1
3 Good interface & can be easily accessible ... 3
4 Best digital Bank account ... 4
5 Better account for everyone ... 4
6 Feature full Mobile banking ... 5
7 Very good bank ... 4
8 Above average online banking experience ... 3
...
I'm currently using this code to web scrape reviews from TrustPilot. I wish to adjust the code to scrape reviews from (https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create). However, unlike most other review sites, the reviews are not separated into multiple sub-pages but there is instead a button at the end of the page to "view more reviews" which shows 3 additional reviews whenever you press it.
Is it possible to adjust the code such that it is able to scrape all the reviews from this particular product within the website with this kind of web structure?
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
print ('all imported successfuly')
# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 44):
names = []
headers = []
bodies = []
ratings = []
published = []
updated = []
reported = []
link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
articles = soup.find_all('article', {'class':'review'})
for article in articles:
names.append(article.find('div', attrs={'class': 'consumer-information__name'}).text.strip())
headers.append(article.find('h2', attrs={'class':'review-content__title'}).text.strip())
try:
bodies.append(article.find('p', attrs={'class':'review-content__text'}).text.strip())
except:
bodies.append('')
try:
#ratings.append(article.find('div', attrs={'class':'star-rating star-rating--medium'}).text.strip())
#ratings.append(article.find('div', attrs={'class': 'star-rating star-rating--medium'})['alt'])
ratings.append(article.find_all("img", alt=True)[0]["alt"])
except:
ratings.append('')
dateElements = article.find('div', attrs={'class':'review-content-header__dates'}).text.strip()
jsonData = json.loads(dateElements)
published.append(jsonData['publishedDate'])
updated.append(jsonData['updatedDate'])
reported.append(jsonData['reportedDate'])
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names, 'Header': headers, 'Body': bodies, 'Rating': ratings, 'Published Date': published, 'Updated Date':updated, 'Reported Date':reported})
df = df.append(temp_df, sort=False).reset_index(drop=True)
print ('pass1')
df.to_csv('BirchboxReviews2.0.csv', index=False, encoding='utf-8')
print ('excel done')
Basically you are dealing with a website which is dynamically loaded via JavaScript code once the page loads, where the comments is rendered with JS code on each scroll down.
I've been able to navigate to the XHR request which obtain the Comments from JS and I've been able to call it and retrieve all comments you asked for.
You don't need to use selenium as it's will slow down your task process.
Here you can achieve your target. assuming that each page include 3 comments. so we just math it to work on the full pages.
import requests
from bs4 import BeautifulSoup
import math
def PageNum():
r = requests.get(
"https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create")
soup = BeautifulSoup(r.text, 'html.parser')
num = int(
soup.find("a", class_="show-more-reviews").text.split(" ")[3][1:-1])
if num % 3 == 0:
return (num / 3) + 1
else:
return math.ceil(num / 3) + 2
def Main():
num = PageNum()
headers = {
'X-Requested-With': 'XMLHttpRequest'
}
with requests.Session() as req:
for item in range(1, num):
print(f"Extracting Page# {item}")
r = req.get(
f"https://boxes.mysubscriptionaddiction.com/get_user_reviews?box_id=105&page={item}", headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for com in soup.findAll("div", class_=r'\"comment-body\"'):
print(com.text[5:com.text.find(r"\n", 3)])
Main()
Simple of the output:
Number of Pages 49
Extracting Page# 1
****************************************
I think Boxycharm overall is the best beauty subscription. However, I think it's
ridiculous that if you want to upgrade you have to pay the 25 for the first box and then add additional money to get the premium. Even though it's only one time,
that's insane. So about 80 bucks just to switch to Premium. And suppose U do that and then my Boxy Premium shows up at my door. I open it ....and absolutely hate
the majority if everything I have. Yeah I would be furious! Not worth taking a chance on. Boxy only shows up half the time with actual products or colors I use.
I love getting the monthly boxes, just wish they would have followed my preferences for colors!
I used to really get excited for my boxes. But not so much anymore. This months
Fenty box choices lack! I am not a clown
Extracting Page# 2
****************************************
Love it its awsome
Boxycharm has always been a favorite subscription box, I’ve had it off and on , love most of the goodies. I get frustrated when they don’t curate it to fit me and or customer service isn’t that helpful but overall a great box’!
I like BoxyCharm but to be honest I feel like some months they don’t even look at your beauty profile because I sometimes get things I clearly said I wasn’t interested in getting.
Extracting Page# 3
****************************************
The BEST sub box hands down.
I love all the boxy charm boxes everything is amazing all full size products and
the colors are outstanding
I absolutely love Boxycharm. I have received amazing high end products. My makeup cart is so full I have such a variety everyday. I love the new premium box and paired with Boxyluxe I recieve 15 products for $85 The products are worth anywhere from $500 to $700 total. I used to spend $400 a month buying products at Ulta. I would HIGHLY recommend this subscription.
Also I have worked out the code for your website. It uses selenium for button clicks and scrolling do let me know if you have any doubts. I still suggest you go through the article first:-
# -*- coding: utf-8 -*-
"""
Created on Sun Mar 8 18:09:45 2020
#author: prakharJ
"""
from selenium import webdriver
import time
import pandas as pd
names_found = []
comments_found = []
ratings_found = []
dateElements_found = []
# Web extraction of web page boxes
print("scheduled to run boxesweb scrapper ")
driver = webdriver.Chrome(executable_path='Your/path/to/chromedriver.exe')
webpage = 'https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create'
driver.get(webpage)
SCROLL_PAUSE_TIME = 6
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight*0.80);")
time.sleep(SCROLL_PAUSE_TIME)
try:
b = driver.find_element_by_class_name('show-more-reviews')
b.click()
time.sleep(SCROLL_PAUSE_TIME)
except Exception:
s ='no button'
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
names_list = driver.find_elements_by_class_name('name')
comment_list = driver.find_elements_by_class_name('comment-body')
rating_list = driver.find_elements_by_xpath("//meta[#itemprop='ratingValue']")
date_list = driver.find_elements_by_class_name('comment-date')
for names in names_list:
names_found.append(names.text)
for bodies in comment_list:
try:
comments_found.append(bodies.text)
except:
comments_found.append('NA')
for ratings in rating_list:
try:
ratings_found.append(ratings.get_attribute("content"))
except:
ratings_found.append('NA')
for dateElements in date_list:
dateElements_found.append(dateElements.text)
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names_found, 'Body': comments_found, 'Rating': ratings_found, 'Published Date': dateElements_found})
#df = df.append(temp_df, sort=False).reset_index(drop=True)
print('extraction completed for the day and system goes into sleep mode')
driver.quit()
I'm using the following python script for scraping info from Amazon pages.
At some point, it stopped returning page results. The script is starting, browsing through the keywords/pages but I only get the headers as output:
Keyword Rank Title ASIN Score Reviews Prime Date
I suspect that the problem is in the following line as this tag doesn't exist anymore and the results var doesn't get any value:
results = soup.findAll('div', attrs={'class': 's-item-container'})
This is the full code:
from bs4 import BeautifulSoup
import time
from selenium import webdriver
import re
import datetime
from collections import deque
import logging
import csv
class AmazonScaper(object):
def __init__(self,keywords, output_file='example.csv',sleep=2):
self.browser = webdriver.Chrome(executable_path='/Users/willcecil/Dropbox/Python/chromedriver') #Add path to your Chromedriver
self.keyword_queue = deque(keywords) #Add the start URL to our list of URLs to crawl
self.output_file = output_file
self.sleep = sleep
self.results = []
def get_page(self, keyword):
try:
self.browser.get('https://www.amazon.co.uk/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords={a}'.format(a=keyword))
return self.browser.page_source
except Exception as e:
logging.exception(e)
return
def get_soup(self, html):
if html is not None:
soup = BeautifulSoup(html, 'lxml')
return soup
else:
return
def get_data(self,soup,keyword):
try:
results = soup.findAll('div', attrs={'class': 's-item-container'})
for a, b in enumerate(results):
soup = b
header = soup.find('h2')
result = a + 1
title = header.text
try:
link = soup.find('a', attrs={'class': 'a-link-normal a-text-normal'})
url = link['href']
url = re.sub(r'/ref=.*', '', str(url))
except:
url = "None"
# Extract the ASIN from the URL - ASIN is the breaking point to filter out if the position is sponsored
ASIN = re.sub(r'.*amazon.co.uk.*/dp/', '', str(url))
# Extract Score Data using ASIN number to find the span class
score = soup.find('span', attrs={'name': ASIN})
try:
score = score.text
score = score.strip('\n')
score = re.sub(r' .*', '', str(score))
except:
score = "None"
# Extract Number of Reviews in the same way
reviews = soup.find('a', href=re.compile(r'.*#customerReviews'))
try:
reviews = reviews.text
except:
reviews = "None"
# And again for Prime
PRIME = soup.find('i', attrs={'aria-label': 'Prime'})
try:
PRIME = PRIME.text
except:
PRIME = "None"
data = {keyword:[keyword,str(result),title,ASIN,score,reviews,PRIME,datetime.datetime.today().strftime("%B %d, %Y")]}
self.results.append(data)
except Exception as e:
print(e)
return 1
def csv_output(self):
keys = ['Keyword','Rank','Title','ASIN','Score','Reviews','Prime','Date']
print(self.results)
with open(self.output_file, 'a', encoding='utf-8') as outputfile:
dict_writer = csv.DictWriter(outputfile, keys)
dict_writer.writeheader()
for item in self.results:
for key,value in item.items():
print(".".join(value))
outputfile.write(",".join('"' + item + '"' for item in value)+"\n") # Add "" quote character so the CSV accepts commas
def run_crawler(self):
while len(self.keyword_queue): #If we have keywords to check
keyword = self.keyword_queue.popleft() #We grab a keyword from the left of the list
html = self.get_page(keyword)
soup = self.get_soup(html)
time.sleep(self.sleep) # Wait for the specified time
if soup is not None: #If we have soup - parse and save data
self.get_data(soup,keyword)
self.browser.quit()
self.csv_output() # Save the object data to csv
if __name__ == "__main__":
keywords = [str.replace(line.rstrip('\n'),' ','+') for line in
open('keywords.txt')] # Use our file of keywords & replaces spaces with +
ranker = AmazonScaper(keywords) # Create the object
ranker.run_crawler() # Run the rank checker
The output should look like this (I have trimmed the Titles for clarity).
Keyword Rank Title ASIN Score Reviews Prime Date
Blue+Skateboard 3 Osprey Complete
Beginn B00IL1JMF4 3.7 40 Prime February 21, 2019
Blue+Skateboard 4 ENKEEO Complete Mini
C B078J9Y1DG 4.5 42 Prime February 21, 2019 Blue+Skateboard 5 skatro -
Mini Cruiser B00K93PIXM 4.8 223 Prime February 21, 2019
Blue+Skateboard 7 Vinsani Retro Cruiser
B00CSV72AK 4.4 8 Prime February 21, 2019 Blue+Skateboard 8 Ridge
Retro Cruiser Bo B00CA33ISQ 4.1 207 Prime February 21, 2019
Blue+Skateboard 9 Xootz Kids Complete
Be B01B2YNSJM 3.6 32 Prime February 21, 2019 Blue+Skateboard 10 Enuff
Pyro II Skateboa B00MGRGX2Y 4.3 68 Prime February 21, 2019
The following shows some changes you could make. I have changed to using css selectors at some points.
The main result set to loop over are retrieved by soup.select('.s-result-list [data-asin]'). This specifies elements with class name .s-result-list having children with attribute data-asin. This matches the 60 (current) items on page.
I swapped the PRIME selection to using an attribute = value selector
Headers are now h5 i.e. header = soup.select_one('h5').
soup.select_one('[aria-label="Amazon Prime"]
Example code:
import datetime
from bs4 import BeautifulSoup
import time
from selenium import webdriver
import re
keyword = 'blue+skateboard'
driver = webdriver.Chrome()
url = 'https://www.amazon.co.uk/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords={}'
driver.get(url.format(keyword))
soup = BeautifulSoup(driver.page_source, 'lxml')
results = soup.select('.s-result-list [data-asin]')
for a, b in enumerate(results):
soup = b
header = soup.select_one('h5')
result = a + 1
title = header.text.strip()
try:
link = soup.select_one('h5 > a')
url = link['href']
url = re.sub(r'/ref=.*', '', str(url))
except:
url = "None"
if url !='/gp/slredirect/picassoRedirect.html':
ASIN = re.sub(r'.*/dp/', '', str(url))
#print(ASIN)
try:
score = soup.select_one('.a-icon-alt')
score = score.text
score = score.strip('\n')
score = re.sub(r' .*', '', str(score))
except:
score = "None"
try:
reviews = soup.select_one("href*='#customerReviews']")
reviews = reviews.text.strip()
except:
reviews = "None"
try:
PRIME = soup.select_one('[aria-label="Amazon Prime"]')
PRIME = PRIME['aria-label']
except:
PRIME = "None"
data = {keyword:[keyword,str(result),title,ASIN,score,reviews,PRIME,datetime.datetime.today().strftime("%B %d, %Y")]}
print(data)
Example output:
When I try to execute the program, I keep getting an IndexError: list index out of range. Here is my code:
''' This program accesses the Bloomberg US Stock information page.
It uses BeautifulSoup to parse the html and then finds the elements with the top 20 stocks.
It finds the the stock name, value, net change, and percent change.
'''
import urllib
from urllib import request
from bs4 import BeautifulSoup
# get the bloomberg stock page
bloomberg_url = "http://www.bloomberg.com/markets/stocks/world-indexes/americas"
try:
response = request.urlopen(bloomberg_url)
except urllib.error.URLError as e:
if hasattr(e, 'reason'):
print('We failed to reach a server.')
print('Reason: ', e.reason)
elif hasattr(e, 'code'):
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
else:
# the url request was successful
html = response.read().decode('utf8')
# use the BeautifulSoup parser to create the beautiful soup object with html structure
bloomsoup = BeautifulSoup(html, 'html.parser')
pagetitle = bloomsoup.title.get_text()
# the 20 stocks are stored in the first 10 "oddrow" tr tags
# and the 10 "evenrow" tr tags
oddrows = bloomsoup.find_all("tr",class_="oddrow")
evenrows = bloomsoup.find_all("tr",class_="evenrow")
# alternate odd and even rows to put all 20 rows together
allrows=[]
for i in range(12):
allrows.append(oddrows[i])
allrows.append(evenrows[i])
allrows.append(oddrows[12])
# iterate over the BeautifulSoup tr tag objects and get the team items into a dictionary
stocklist = [ ]
for item in allrows:
stockdict = { }
stockdict['stockname'] = item.find_all('a')[1].get_text()
stockdict['value'] = item.find("td",class_="pr-rank").get_text()
stockdict['net'] = item.find('span',class_="pr-net").get_text()
stockdict['%'] = item.find('td',align="center").get_text()
stocklist.append(stockdict)
# print the title of the page
print(pagetitle, '\n')
# print out all the teams
for stock in stocklist:
print('Name:', stock['stockname'], 'Value:', stock['value'], 'Net Change:', stock['net'],\
'Percent Change:', stock['%'])
oddrows and evenrows only have 10 elements according to your comment.
the 20 stocks are stored in the first 10 "oddrow" tr tags and the 10 "evenrow" tr tags
But you loop 12 times instead of 10: for i in range(12):
Change 12 to 10 and it should work.
Side note: I don't suggest hardcoding that value.
You could replace
allrows=[]
for i in range(12):
allrows.append(oddrows[i])
allrows.append(evenrows[i])
with
allrows=[]
for x,y in zip(oddrows,evenrows):
allrows.append(x)
allrows.append(y)
Replace the allrows looping by the following. It's a little bit weird and very much prone to bugs.
import itertools
allrows = [i for z in itertools.zip_longest(oddrows, evenrows) for i in z if i]
If you don't want to have indexing bugs/problems, just eliminate them. Go more functional.