Selenium Web scraping nested divs with no ids or class names

Selenium Web scraping nested divs with no ids or class names - python

I am trying to get the product name and the quantity from a nested HTML table using selenium. My problem is some of the divs don't have any id or class names. The table I am trying to access is the Critical Product list. Here is what I have done but I do seem to be lost at how I can get the nested divs. The site is in the code.
options = Options()
options.add_argument('start-maximized')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'/usr/local/bin/chromedriver/')
url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
time.sleep(150)
page = driver.page_source
driver.quit()
html_soup = BeautifulSoup(page, 'html.parser')
item_containers = html_soup.find_all('div', class_='critical-products-title hide-mobile')
if item_containers:
for item in item_containers:
for link in item.findAll('a', ) # need to loop the inner divs to reach the href and then get to the left and right classes to get title and quantity
print(item)
Here is the image from the inspection. I want to be able to loop through all the divs and get the title and quantity.

You don't need beautiful soup, nor to save the page_source.
I used a CSS selector to select all the target rows in the table and then applied list comprehension to choose the left and right sides of each row. I outputted the results to a list of tuples.
options = Options()
options.add_argument('start-maximized')
driver = webdriver.Chrome(chrome_options=options, executable_path=r'/usr/local/bin/chromedriver/')
url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
time.sleep(150)
elements = driver.find_elements_by_css_selector('#app > div:nth-child(1) > div.header-wrapper > div.header-right > div.critical-product-table-container > div.table.shorten.hide-mobile > div > div > div > a > div')
targetted_values = [(element.find_element_by_css_selector('.line-item-left').text, element.find_element_by_css_selector('.line-item-right').text) for element in elements]
driver.quit()
Example output of targetted_values:
[('Surgical & Reusable Masks', '376,713,363 available'),
('Disposable Gloves', '66,962,093 available'),
('Gowns and Coveralls', '40,502,145 available'),
('Respirators', '22,189,273 available'),
('Surface Wipes', '20,650,831 available'),
('Face Shields', '16,535,686 available'),
('Hand Sanitizer', '11,152,890 L available'),
('Thermometers', '8,457,993 available'),
('Testing Kits', '2,110,815 available'),
('Surface Solutions', '107,452 L available'),
('Protective Barriers', '10,833 available'),
('Ventilators', '410 available')]

To print the product name and the quantity you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR and text attribute:
driver.get('https://www.rrpcanada.org/#/')
items = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table.shorten.hide-mobile > div div.line-item-title")))]
quantities = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.table.shorten.hide-mobile > div div.line-item-bold.available")))]
for i,j in zip(items,quantities):
print(i, j)
Using XPATH and get_attribute("innerHTML"):
driver.get('https://www.rrpcanada.org/#/')
items = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#class='table shorten hide-mobile']/div//div[#class='line-item-title']")))]
quantities = [my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#class='table shorten hide-mobile']/div//div[#class='line-item-bold available']")))]
for i,j in zip(items,quantities):
print(i, j)
Console Output:
Surgical & Reusable Masks 376,713,363 available
Disposable Gloves 66,962,093 available
Gowns and Coveralls 40,502,145 available
Respirators 22,189,273 available
Surface Wipes 20,650,831 available
Face Shields 16,535,686 available
Hand Sanitizer 11,152,890 L available
Thermometers 8,457,993 available
Testing Kits 2,110,815 available
Surface Solutions 107,452 L available
Protective Barriers 10,833 available
Ventilators 410 available
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in How to retrieve the text of a WebElement using Selenium - Python
Outro
Link to useful documentation:
get_attribute() method Gets the given attribute or property of the element.
text attribute returns The text of the element.
Difference between text and innerHTML using Selenium

You have to use relative xpath to find the element with class="line-item-left" for the name of each item and the element with class="line-item-right" for the number of available items.
driver.find_elements_by_class_name("line-item-left") //Item names
driver.find_elements_by_class_name("line-item-right") //Number of items available
Note the 's' in elements

This is the selector for product name:
div.critical-product-table-container div.line-item-left
And for total:
div.critical-product-table-container div.line-item-right
But the following approach is without BeautifulSoup.
time.sleep(...) is bad practice, please use WebDriverWait instead.
And to pair the above two variables and perform parallel looping, I try to use the zip() function:
url = 'https://www.rrpcanada.org/#/' # site I'm scraping
driver.get(url)
wait = WebDriverWait(driver, 150)
product_names = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.critical-product-table-container div.line-item-left')))
totals = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, 'div.critical-product-table-container div.line-item-right')))
for product_name, total in zip(product_names, totals):
print(product_name.text +'--' +total.text)
driver.quit()
You need following import:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

Related

Selenium: find elements by classname returns only one item instead of all

The code below sometimes returns one (1) element, sometimes all, and sometimes none. For it to work for my application I need it to return all the matching elements in the page
Code trials:
from selenium import webdriver
from selenium.webdriver.common.by import By
def villanovan():
driver = webdriver.Chrome()
driver.implicitly_wait(10)
url = 'http://webcache.googleusercontent.com/search?q=cache:https://villanovan.com/&strip=0&vwsrc=0'
url_2 = 'https://villanovan.com/'
driver.get(url_2)
a = driver.find_elements(By.CLASS_NAME, "homeheadline")
titles = [i.text for i in a if len(i.text) != 0]
links = [i.get_attribute('href') for i in a if len(i.text) != 0]
return [titles, links]
if __name__ == "__main__":
print(villanovan())
I was expecting a list with multiple links and article titles, but recieved a list with the first element found, not all elements found.

To extract the value of href attributes you can use list comprehension and you can use either of the following locator strategies:
Using CSS_SELECTOR:
driver.get("https://villanovan.com/")
time.sleep(3)
print([my_elem.get_attribute("href") for my_elem in driver.find_elements(By.CSS_SELECTOR, "a.homeheadline[href]")])
Using XPATH:
driver.get("https://villanovan.com/")
time.sleep(3)
print([my_elem.get_attribute("href") for my_elem in driver.find_elements(By.XPATH, "//a[#class='homeheadline' and #href]")])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
['https://villanovan.com/22105/sports/villanova-goes-cold-in-clutch-against-no-14-marquette/', 'https://villanovan.com/22102/sports/villanova-bests-marquette-in-blowout-win-73-54/', 'https://villanovan.com/22098/news/decarbonizing-villanova-a-town-hall-on-fossil-fuel-divestment/', 'https://villanovan.com/22096/news/biology-professors-granted-1-million-for-wetlands-research/', 'https://villanovan.com/22093/news/students-create-the-space-supporting-sex-education/', 'https://villanovan.com/22098/news/decarbonizing-villanova-a-town-hall-on-fossil-fuel-divestment/', 'https://villanovan.com/22096/news/biology-professors-granted-1-million-for-wetlands-research/', 'https://villanovan.com/22044/culture/julia-staniscis-leaning-on-letters/', 'https://villanovan.com/22032/culture/villanova-sorority-recruitment-recap/', 'https://villanovan.com/22105/sports/villanova-goes-cold-in-clutch-against-no-14-marquette/', 'https://villanovan.com/22102/sports/villanova-bests-marquette-in-blowout-win-73-54/', 'https://villanovan.com/21932/opinion/villanova-should-be-free-for-families-earning-less-than-100000/', 'https://villanovan.com/21897/opinion/grasshoppergate-the-state-of-villanova-dining/', 'https://villanovan.com/22105/sports/villanova-goes-cold-in-clutch-against-no-14-marquette/', 'https://villanovan.com/22102/sports/villanova-bests-marquette-in-blowout-win-73-54/', 'https://villanovan.com/22093/news/students-create-the-space-supporting-sex-education/', 'https://villanovan.com/22090/news/mlk-day-of-service/', 'https://villanovan.com/22087/news/university-updates-covid-procedures/']

How to scrape multiple data points that pop up after mouseover event?

I am trying to scrape data from the following website:
https://prisjagt.dk/lyd-billede/horetelefoner-tilbehor/hovedtelefoner/apple-airpods-pro-2nd-generation-2022--p7054034
I would like to scrape data from the graph in the upper right called "Prishistorik", but the data only appears when hovering the mouse over specific points on the graph.
Using the code below, I managed to get one output. However, it seems that every point on the chart has the same xpath, so how can I scrape all the different dates and corresponding prices across the chart?
Thanks in advance for any help!
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("start-maximized")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
DRIVER_PATH = r'd:\8278\Downloads\chromedriver.exe'
browser = webdriver.Chrome(options=chrome_options, executable_path=DRIVER_PATH)
browser.get("https://prisjagt.dk/lyd-billede/horetelefoner-tilbehor/hovedtelefoner/apple-airpods-pro-2nd-generation-2022--p7054034")
browser.maximize_window()
explicit_wait20 = WebDriverWait(browser, 20)
try:
prices = explicit_wait20.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '#root > div > section > div.Content-sc-2fu3f8-2.hybPGh > div.PageContent-sc-1wgu331-5.fbKCSg > div > div > div > div:nth-child(4) > div.StyledViewport-sc-7zjdbj-0.iDfoDl > header > div.ProductSummary-sc-16x82tr-1.cfOOAp > div.PriceHistoryLinkWrapper-sc-yn9z6-0.iQTEhC > a > div > div > div.PriceHistoryWrapper-sc-1yjg8cb-3.ejhsba > div > svg > g:nth-child(3) > rect')))
except TimeoutException:
browser.refresh()
data = []
for p in prices:
# Execute mouseover on the element
ActionChains(browser).move_to_element(p).perform()
mouseover = WebDriverWait(browser, 5).until(EC.visibility_of_element_located((By.XPATH, '//*[#id="root"]/div/section/div[2]/div[2]/div/div/div/div[3]/div[1]/header/div[4]/div[1]/a/div/div/div[2]/div/div')))
data.append(mouseover.text)
print(data[0])

In your case you don't even need to use selenium to find the price history. You can use requests, calling graphql endpoint with some parameters gives you a list of price history under data['product']['statistics']['nodes'].
See example call.

How to click on every title within the page to scrape the title

For example this is the main page link
https://www.nationalhardwareshow.com/en-us/attend/exhibitor-list.html
go to these page then click on first title like these 10X Innovations - Swift ULVand then get the title
This is code
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from time import sleep
PATH="C:\Program Files (x86)\chromedriver.exe"
driver =webdriver.Chrome(PATH)
driver.get('https://www.nationalhardwareshow.com/en-us/attend/exhibitor-list.html')
data = []
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
hrefs = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h3[#class='text-center-mobile wrap-word']//ancestor::a[1]")))[:5]]
windows_before = driver.current_window_handle
for href in hrefs:
driver.execute_script("window.open('" + href +"');")
WebDriverWait(driver, 10).until(EC.number_of_windows_to_be(2))
windows_after = driver.window_handles
new_window = [x for x in windows_after if x != windows_before][0]
driver.switch_to.window(new_window)
data.append(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//h1[#class='wrap-word']"))).text)
driver.close()
driver.switch_to.window(windows_before)
print(data)

Your current problem is invalid XPath //div[contains(#class,'company-info']//h3)] - wrong parentheses usage. You need to use //div[contains(#class,'company-info')]//h3 instead.
However, if you want to scrape data from each company entry on page then your clicking links approach is not good.
Try to implement the following:
get href attribute of every link. Since not all links initially displayed on page you need to trigger all possible XHRs, so create count variable to get current links count and do in while loop:
execute send END hardkey to scroll page down
try to wait until current links count > count. If True - re-define count with new value. If Exception - break the loop (there are no more links remain to load)
get href of all link nodes //div[#class="company-info"]//a
in for loop navigate to each link with driver.get(<URL>)
scrape data

With in the 2022 EXHIBITOR LIST webpage to click() on each link to scrape you can collect the href attributes and open them in the adjascent tab as follows:
Code Block (sample for first 5 entries):
driver.get('https://www.nationalhardwareshow.com/en-us/attend/exhibitor-list.html')
data = []
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
hrefs = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h3[#class='text-center-mobile wrap-word']//ancestor::a[1]")))[:5]]
windows_before = driver.current_window_handle
for href in hrefs:
driver.execute_script("window.open('" + href +"');")
WebDriverWait(driver, 20).until(EC.number_of_windows_to_be(2))
windows_after = driver.window_handles
new_window = [x for x in windows_after if x != windows_before][0]
driver.switch_to.window(new_window)
data.append(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h1[#class='wrap-word']"))).text)
driver.close()
driver.switch_to.window(windows_before)
print(data)
Console Output:
['10X Innovations - Swift ULV', '21st Century Inc', '3V Snap Ring LLC.', 'A-ipower Corp', 'A.A.C. Forearm Forklift Inc']

How to get the style value in a div tag in Python?

I desire to crawl images in a single webpage, and the image URLs are in a div tag, which are validated as a style value, like so:
<div class="v-image__image v-image__image--cover" style="background-image: url("https://mashinbank.com/api/parse/files/7uPtEVa0plEFoNExiYHcbtL1rQnpIGnnPHVuvKKu/f3ea4910e239eb704af755c65f548e35_car.png"); background-position: center center;"></div>
I want to get that: https://mashinbank.com/api/parse/files/7uPtEVa0plEFoNExiYHcbtL1rQnpIGnnPHVuvKKu/f3ea4910e239eb704af755c65f548e35_car.png
But when I try chrome driver find elements or soup.find they return empty lists and that's because the text between the div tag is nothing.
I'm looking for a way to get inside the div tag, not between.

To get all the images you should, use presence of all the elements.
Once you've the list in Python, such as all_images (See below), you can remove the () and "" like below.
Sample code :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
#driver.implicitly_wait(50)
wait = WebDriverWait(driver, 20)
links = []
driver.get("https://mashinbank.com/ad/GkbI20tzp3/%D8%AE%D8%B1%DB%8C%D8%AF-%D9%BE%D8%B1%D8%A7%DB%8C%D8%AF-111-SE-1397")
all_images = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[contains(#class,'v-image__image--cover')]")))
for image in all_images:
a = image.get_attribute('style')
b = a.split("(")[1].split(")")[0].replace('"', '')
links.append(b)
print(links)
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Output :
['https://mashinbank.com/api/parse/files/7uPtEVa0plEFoNExiYHcbtL1rQnpIGnnPHVuvKKu/cabdf9f3f379e5b839300f89a90ab27e_car.png', 'https://mashinbank.com/api/parse/files/7uPtEVa0plEFoNExiYHcbtL1rQnpIGnnPHVuvKKu/e1c6c75dda980a6b4b4a83932ed49832_car.png', 'https://mashinbank.com/api/parse/files/7uPtEVa0plEFoNExiYHcbtL1rQnpIGnnPHVuvKKu/81ef7c57ca349485a9ba78bf0e42e13f_car.png', 'https://mashinbank.com/api/parse/files/7uPtEVa0plEFoNExiYHcbtL1rQnpIGnnPHVuvKKu/02bd13f2c5ce936ec3db10706c03854d_car.png', 'https://mashinbank.com/api/parse/files/7uPtEVa0plEFoNExiYHcbtL1rQnpIGnnPHVuvKKu/cabdf9f3f379e5b839300f89a90ab27e_car.png', 'https://mashinbank.com/api/parse/files/7uPtEVa0plEFoNExiYHcbtL1rQnpIGnnPHVuvKKu/e1c6c75dda980a6b4b4a83932ed49832_car.png', 'https://mashinbank.com/api/parse/files/7uPtEVa0plEFoNExiYHcbtL1rQnpIGnnPHVuvKKu/81ef7c57ca349485a9ba78bf0e42e13f_car.png', 'https://mashinbank.com/api/parse/files/7uPtEVa0plEFoNExiYHcbtL1rQnpIGnnPHVuvKKu/02bd13f2c5ce936ec3db10706c03854d_car.png']

Selenium solution for this issue can be as following:
You probably should wait for the element visibility and only after that to extract the element attribute.
Split the entire style attribute to get the url value.
Like this:
wait = WebDriverWait(driver, 20)
element = wait.until(EC.visibility_of_element_located((By.XPATH, "//div[contains(#style,'https://mashinbank.com/api/parse/files')]")))
style_content = element.get_attribute("style")
url = style_content.split(";")[1]

Python Selenium select element a few questions

I have two questions.
I'm trying to get the text value (Medium Coverage, Liquid Formula) But there are no locators. These text values are different from pages but the locations are the same. Is there any way to get this text like
find_element_by_class('css_s6sd4y.eanm7710') -> go down to one more row then select the element
There is a class name. But I wonder if there is another way to find the element using data-at = 'sku_size_label'. And what is 'data-at' locator called?

for first question, you can use the below code :
to get the Medium Coverage :
wait = WebDriverWait(driver, 10)
ele = wait.until(EC.visibility_of_element_located((By.XPATH, "//img[#alt='Medium Coverage']/.."))).text
print(ele)
or
wait = WebDriverWait(driver, 10)
ele = wait.until(EC.visibility_of_element_located((By.XPATH, "//img[#alt='Medium Coverage']/.."))).get_attribute('innerHTML')
print(ele)
to get the Liquid Formula :
wait = WebDriverWait(driver, 10)
ele = wait.until(EC.visibility_of_element_located((By.XPATH, "//img[#alt='Liquid Formula']/.."))).text
print(ele)
or
wait = WebDriverWait(driver, 10)
ele = wait.until(EC.visibility_of_element_located((By.XPATH, "//img[#alt='Liquid Formula']/.."))).get_attribute('innerHTML')
print(ele)
for your second question :-
Yes you can use data-at = 'sku_size_label' in xpath or css :
Xpath below :
//span[contains(#data-at, 'sku_size_label')]
CSS below :
span[data-at = 'sku_size_label']
and for this question :
what is 'data-at' locator called? - They are not called locators, they are basically attributes of the respective tag.
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selenium Web scraping nested divs with no ids or class names - python

Related

Selenium: find elements by classname returns only one item instead of all

How to scrape multiple data points that pop up after mouseover event?

How to click on every title within the page to scrape the title

How to get the style value in a div tag in Python?

Python Selenium select element a few questions

Categories

Resources