Scraping hidden product details on a webpage using Selenium

Scraping hidden product details on a webpage using Selenium - python

Sorry I am a Selenium noob and have done a lot of reading but am still having trouble getting the product price (£0.55) from this page:
https://groceries.asda.com/product/spaghetti-tagliatelle/asda-spaghetti/36628. Product details are not visible when parsing the html using bs4. Using Selenium I can get a string of the entire page and can see the price in there (using the following code). I should be able to extract the price from this somehow but would prefer a less hacky solution.
browser = webdriver.Firefox(executable_path=r'C:\Users\Paul\geckodriver.exe')
browser.get('https://groceries.asda.com/product/tinned-tomatoes/asda-smart-price-chopped-tomatoes-in-tomato-juice/19560')
content = browser.page_source
If I run something like this:
elem = driver.find_element_by_id("bodyContainerTemplate")
print(elem)
It just returns: selenium.webdriver.firefox.webelement.FirefoxWebElement (session="df23fae6-e99c-403c-a992-a1adf1cb8010", element="6d9aac0b-2e98-4bb5-b8af-fcbe443af906")
The price is the text associated with this element: p class="prod-price" but I cannot seem to get this working. How should I go about getting this text (the product price)?

The type of elem is WebElement. If you need to extract text value of web-element you might use below code:
elem = driver.find_element_by_class_name("prod-price-inner")
print(elem.text)

Try this solution, it works with selenium and beautifulsoup
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://groceries.asda.com/product/spaghetti-tagliatelle/asda-spaghetti/36628'
driver = webdriver.PhantomJS()
driver.get(url)
data = driver.page_source
soup = BeautifulSoup(data, 'html.parser')
ele = soup.find('span',{'class':'prod-price-inner'})
print ele.text
driver.quit()
It will print :
£0.55

Related

Python (with selenium) how to modify and activate elements to update webpage before scraping data

I am trying to scrape some data from https://marvelsnapzone.com/decks/ but I would like to modify the table of decks before scraping them. For example:
Adding card names:
I am trying to add new div id="tagsblock" with certain names like class="tag card" "Angela "
Executing the "Search":
I would then like to execute the id="searchdecks" command to update the table of decks
Sorting by ascending "Likes":
Lastly I want to edit the span data-sorttype="likes" class to say span data-sorttype="likes" class ="asc"
Below is my current python script which doesn't seem to sort the "Likes" before scraping the deck info. It also currently does not add cards or execute the "Search".
import re
import requests
import os
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def scrap():
url = 'https://marvelsnapzone.com/decks'
chrome_options = Options()
chrome_options.headless = True
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-extensions')
chrome_options.add_argument('--disable-gpu')
browser = webdriver.Chrome(options=chrome_options)
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
# I would like to add cards and execute the "Search" option here
selects = soup.findAll('span', {'data-sorttype': 'likes'})
for select in selects:
browser.execute_script("arguments[0].setAttribute('class', 'asc')", select)
# this does not seem to sort the table, this is based on the data scraped later
links = soup.findAll('a', {'class': 'card cardtooltip maindeckcard tooltiploaded'})
# ... more web-scraping code ...
# I am able to scrape the information after this, but I am struggling to modify the table
# before scraping the information.
if __name__ == '__main__':
characters = scrap()

Usually sites are dynamic and thus load new data via a script when you click on a button. This means that in these cases if you set an attribute with selenium the site will not change.
That said, your code has some errors which I think are caused by the fact that you think selenium and beautifulsoup talk to each other (i.e. interact).
By doing this
soup = BeautifulSoup(...)
browser.execute_script(...)
links = soup.findAll(...)
you are trying to "update" soup by executing a script, but it doesn't work like that, in fact soup is an immutable object. So when you run soup.findAll(...) you are using an "old" soup which doesn't contain the modifications following from browser.execute_script(...).
By doing this
browser.execute_script("arguments[0].setAttribute('class', 'asc')", select)
you are trying to use selenium to set an attribute of an object found with beautifulsoup. You cannot do this. The correct way is to find the element with selenium
select = browser.find_element(By.CSS_SELECTOR, '[data-sorttype=likes]')
browser.execute_script("arguments[0].setAttribute('class', 'asc')", select)
Anyway this doesn't work because as I said in the beginning, if you set an attribute with selenium the site will not change.
In this code block
selects = soup.findAll('span', {'data-sorttype': 'likes'})
for select in selects:
# do something with select
why doing a loop if you just want to set one attribute? Use soup.find (returns a webelement) instead of soup.findAll (returns a list, in this case a list with only one element)
select = soup.find('span', {'data-sorttype': 'likes'})
# do something with select
So the correct sequence of commands to sort the table and scrape it with beautifulsoup is the following
browser.get(url)
# click on "Likes" button
select = driver.find_element(By.CSS_SELECTOR, '[data-sorttype=likes]')
select.click()
time.sleep(2) # wait to be sure that the table is sorted
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('a', {'class': 'card cardtooltip maindeckcard tooltiploaded'})
Notice that beautifulsoup is not mandatory to scrape the page, you can use selenium too.

Extracting the titles of the websites mentioned in the link

https://www.g2.com/categories/marketing-automation
I am trying webscrap the above link that has list of 350+ websites i need to extract the title of the websites mentioned
But I am failing to get any results i have tried with using requests and beautiful soup
then with selenium and all i am getting is empty list "[]" or none
import requests
from bs4 import BeautifulSoup
# Send a GET request to the URL and parse the HTML content
url = 'https://www.g2.com/categories/marketing-automation'
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
name = soup.find(class_ = "product-card__product-name")
print(name)
This above code is just a test code to check if the data is being pulled or not and the response is 'None'
From this code i am expecting to see the results of the class mentioned upon calling print

I kind of got this code to get something. Im still working on it.
from selenium import webdriver
from selenium.webdriver.common.by import By
# Create a new instance of the Chrome driver
driver = webdriver.Chrome()
# Navigate to the webpage
driver.get('https://www.g2.com/categories/marketing-automation')
# Wait for the page to load
driver.implicitly_wait(10)
# Find all the product cards on the page
product_cards = driver.find_elements(By.CLASS_NAME, 'product-card__product-name')
# Iterate over the product cards and extract the title from each one
for product_card in product_cards:
title = product_card.text
print(title)
# Close the browser
driver.quit()

Python - Item Price Web Scraping for Target

I'm trying to get any item's price from Target website. I did some examples for this website using selenium and Redsky API but now I tried to wrote bs4 code below:
import requests
from bs4 import BeautifulSoup
url = "https://www.target.com/p/ruffles-cheddar-38-sour-cream-potato-chips-2-5oz/-/A-14930847#lnk=sametab"
r= requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
price = soup.find("div",class_= "web-migration-tof__PriceFontSize-sc-14z8sos-14 elGGzp")
print(price)
But it returns me None .
I tried soup.find("div",{'class': "web-migration-tof__PriceFontSize-sc-14z8sos-14 elGGzp"})
What am I missing?
I can accept any selenium code or Redsky API code but my priority is bs4

The page is dynamic. The data is rendered after the initial request is made. You can use selenium to load the page, and once it's rendered, then you can pull out the relevant tag. API though is always the preferred way to go if it's available.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
# If you don't want to open a browser, comment out the line above and uncomment below
#options = webdriver.ChromeOptions()
#options.add_argument('headless')
#driver = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe', options=options)
url = "https://www.target.com/p/ruffles-cheddar-38-sour-cream-potato-chips-2-5oz/-/A-14930847#lnk=sametab"
driver.get(url)
r = driver.page_source
soup = BeautifulSoup(r, "lxml")
price = soup.find("div",class_= "web-migration-tof__PriceFontSize-sc-14z8sos-14 elGGzp")
print(price.text)
Output:
$1.99

You are simply using wrong locator.
Try this
price_css_locator = 'div[data-test=product-price]'
or in XPath style
price_xpath_locator = '//div[#data-test="product-price"]'
With bs4 it should be something like this:
soup.select('div[data-test="product-price"]')
to get the element get you just need to add .text
price = soup.select('div[data-test="product-price"]').text
print(price)

use .text
price = soup.find("div",class_= "web-migration-tof__PriceFontSize-sc-14z8sos-14 elGGzp")
print(price.text)

BeatifulSoap find() returns "None" with any name/attributes

I'm trying to get some informations about a product i'm interested in, on Amazon.
I'm using BeatifulSoap library for webscraping :
URL = 'https://www.amazon.it/gp/offer-listing/B08KHL2J5X/ref=dp_olp_unknown_mbc'
page = requests.get(URL,headers=headers)
soup = BeautifulSoup(page.content,'html.parser')
title = soup.find('span',class_='a-size-large a-color-price olpOfferPrice a-text-bold')
print(title)
In the pic, the highlined row it's the one i want to select, but when i run my script i get 'None' everytime. (Printing the entire output after BeatifulSoap call, give me the entire HTML source, so i'm using the right URL)
Any solutions?

You need to use .text() to get the text of an element.
so change:
print(title)
to:
print(title.text)
Output:
EUR 1.153,00

I wouldn't use BS alone in this case. You can easily use add Selenium to scrape the website:
from bs4 import BeautifulSoup
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium import webdriver
url = 'https://www.amazon.it/gp/offer-listing/B08KHL2J5X/ref=dp_olp_unknown_mbc'
driver = webdriver.Safari()
driver.get(url)
html_content = driver.page_source
soup = BeautifulSoup(html_content, "html.parser")
title = soup.find('span',class_='a-size-large a-color-price olpOfferPrice a-text-bold')
print(title)
If you don't can use Safari you have to download the webdriver for Chrome, Firefox etc. but there is plenty of reading material on this topic.

Getting Different Results For Web Scraping

I was trying to do web scraping and was using the following code :
import mechanize
from bs4 import BeautifulSoup
url = "http://www.thehindu.com/archive/web/2010/06/19/"
br = mechanize.Browser()
htmltext = br.open(url).read()
link_dictionary = {}
soup = BeautifulSoup(htmltext)
for tag_li in soup.findAll('li', attrs={"data-section":"Chennai"}):
for link in tag_li.findAll('a'):
link_dictionary[link.string] = link.get('href')
print link_dictionary[link.string]
urlnew = link_dictionary[link.string]
brnew = mechanize.Browser()
htmltextnew = brnew.open(urlnew).read()
articletext = ""
soupnew = BeautifulSoup(htmltextnew)
for tag in soupnew.findAll('p'):
articletext += tag.text
print articletext
I was unable to get any printed values by using this. But on using attrs={"data-section":"Business"} instead of attrs={"data-section":"Chennai"} I was able to get the desired output. Can someone help me?

READ THE TERMS OF SERVICES OF THE WEBSITE BEFORE SCRAPING
If you are using firebug or inspect element in Chrome, you might see some contents that will not be seen if you are using Mechanize or Urllib2.
For example, when you view the source code of the page sent out by you. (Right click view source in Chrome). and search for data-section tag, you won't see any tags which chennai, I am not 100% sure but I will say those contents need to be populated by Javascript ..etc. which requires the functionality of a browser.
If I were you, I will use selenium to open up the page and then get the source page from there, then the HTML collected in that way will be more like what you see in a browser.
Cited here
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Firefox()
driver.get("URL GOES HERE")
# I noticed there is an ad here, sleep til page fully loaded.
time.sleep(10)
soup = BeautifulSoup(driver.page_source)
print len(soup.findAll(...}))
# or you can work directly in selenium
...
driver.close()
And the output for me is 8

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping hidden product details on a webpage using Selenium - python

The type of elem is WebElement. If you need to extract text value of web-element you might use below code: elem = driver.find_element_by_class_name("prod-price-inner") print(elem.text)

Related

Python (with selenium) how to modify and activate elements to update webpage before scraping data

Extracting the titles of the websites mentioned in the link

Python - Item Price Web Scraping for Target

BeatifulSoap find() returns "None" with any name/attributes

Getting Different Results For Web Scraping

Categories

Resources