Choosing appropriate locators when scraping dynamic content with Python and Selenium

Choosing appropriate locators when scraping dynamic content with Python and Selenium - python

I am trying to understand the correct way to select specific elements of a webpage using python and selenium, I am uncertain what dictates which approach to take such as xpath or CSS and so on.
https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu
<a class="consumer-product-card__StyledLink-ncbvk2-1 jpGhIo" href="/embedded-menu/berkshire-roots/menu/cbd-tincture-2-1-225mg">
<span>CBD Tincture 2:1 225mg Details</span>
<div class="product-card__Container-sc-7s6mw-0 iWHVJj">
<div class="product-card__Content-sc-7s6mw-1 cfcIOW">
<div class="product-information__Container-sc-65h5ke-0 ejVwks">
<img class="product-information__StyledProductImage-sc-65h5ke-1 jupjtQ" width="218" height="218" src="https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&dpr=1&bg=FFFFFF&crop=faces&fit=fill&w=218&h=218&ixlib=react-7.2.0" alt="" srcset="https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&dpr=2&bg=FFFFFF&crop=faces&fit=fill&w=218&h=218&ixlib=react-7.2.0 2x, https://images.dutchie.com/0f253b35120facc1465b75b08bfd4d66?auto=format&dpr=3&bg=FFFFFF&crop=faces&fit=fill&w=218&h=218&ixlib=react-7.2.0 3x">
<div class="product-information__ProductInfo-sc-65h5ke-2 bwhblJ">
<div class="product-information__Price-sc-65h5ke-7 eEqLUB">$36.95</div>
<div class="product-information__BrandContainer-sc-65h5ke-5 dlSlvE list-only">
<div class="product-information__Brand-sc-65h5ke-6 ftehWE">Berkshire Roots</div>
</div>
<div class="product-information__TitleContainer-sc-65h5ke-3 fOoVwz list-only false">
<div class="product-information__Title-sc-65h5ke-4 eBIyJW --line2">CBD Tincture 2:1 225mg</div>
</div>
<div class="product-information__TitleContainer-sc-65h5ke-3 fOoVwz mobile-and-card">
<div class="product-information__Title-sc-65h5ke-4 eBIyJW">CBD Tincture 2:1</div>
<div class="product-information__Title-sc-65h5ke-4 eBIyJW --line2"> 225mg</div>
</div>
<div class="product-information__DetailsContainer-sc-65h5ke-9 ifqkuO">
<div class="product-information__Strain-sc-65h5ke-10 eWkod --high-cbd">High CBD</div>
<div class="product-information__PotencyInfo-sc-65h5ke-14 gUReQf"><b>THC: </b>72.3 mg | <b>CBD: </b>160.3 mg</div>
</div>
</div>
</div>
<div class="product-weights__Container-nwgli1-0 gwUwAi">
<div class="product-weights__Weights-nwgli1-1 kiObrJ">
<div aria-label="Add 0.41g to cart for $36.95" data-cy="product-card-weight" class="weight__Container-sc-11f1l3-2 dNvnhd">
<div class="weight__Price-sc-11f1l3-4 ZtHqz">$36.95</div>
<div class="weight__IconContainer-sc-11f1l3-1 zqIJt">
<svg xmlns="http://www.w3.org/2000/svg" width="11" height="11" viewBox="0 0 10 10">
<path fill="#A6ACB3" fill-rule="nonzero" d="M9.176 5c0-.407-.031-.723-.438-.723l-3.022.007.007-3.022c0-.407-.326-.428-.722-.438-.407 0-.723.03-.722.436l.003 3.012-3.022.007c-.406 0-.426.325-.436.722-.01.396.031.722.438.722l3.022-.007.003 3.012c0 .407.326.427.723.438.407 0 .722-.03.721-.437l-.003-3.011 3.012.003c.406 0 .437-.315.436-.722z"></path>
</svg>
</div>
</div>
<div class="product-weights__Fill-nwgli1-2 dtfdkt"></div>
</div>
</div>
</div>
</div>
</a>
How would I use a loop of sorts to access each and every "consumer-product-card" without having scrolled to the bottom of the page? Or would I need to force the page to scroll first? Is the "consumer-product-card" approach correct or would xpath make more sense? With either I find it difficult to understand which is ideal for what reason, or even how to select it in one instance, as well as the next and next until I reach the end.
Thank you.

To find all cards use:
driver.find_elements_by_xpath("//div[contains(#class,'consumer-product-card__InViewContainer-ncbvk2-0 dWfGpk')]")
Then use as an example links I gave you in the previous question.
UPDATE
Solution to start with:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
driver.get('https://dutchie.com/embedded-menu/revolutionary-clinics-somerville/menu')
wait = WebDriverWait(driver, 30)
wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".product-card__Content-sc-7s6mw-1.cfcIOW")))
cards = driver.find_elements_by_css_selector(".product-card__Content-sc-7s6mw-1.cfcIOW")
data = []
for card in cards:
name = card.find_element_by_css_selector(".product-information__TitleContainer-sc-65h5ke-3.fOoVwz.list-only").text
data.append(name)
for i in data:
print(i)
It waits for cards and prints their names. But scrolling etc, other elements are completely different questions.
I found css selectors more suitable for this case.
Result is three items:
Rick Simpson Oil (RSO)
Live Sugar - Purple Pineapple Express
Live Sugar - Gelato #33

This is kind of an opinionated question.
I would likely use the simplest CSS Selector I can find that uniquely defines the element. XPath is slower and, I find, likely more brittle and harder to find good selectors for elements. But there is no "correct" approach.
I'm a little confused regarding the goal of the rest of the question. I think we would need some more detail and the code you've used to attempt this.
Also, your HTML is formatted on one line and very hard to view.

Related

Finding related fields / iterating through Selenium browser results

I am trying to iterate through a set of results, similar to the below, so to select that I perform the below:
for a in browser.find_elements_by_css_selector(".inner-row"):
What I then want to do is return:
a. The x in the class next to time-x (e.g. 26940 in the example)
b. filter to bananas only
c. Grab the suffix to "row-x" in the id
For each result. I can then iterate through the results for each of these that meets the parameters.
I have tried the get attribute function but this doesn't return any results, and .text is out of the question due to no real information between the tags.
<div id="bookingResults bookingGroup-111">
<div id="row-1522076067"
class="row row-time group-111 time-26940 amOnly bananas groupOnly rule-1252"
style="display: block;">
<div class="lockOverlay lock-row-124" style="display: none;"><div class="lockInfoCont"><p class="lockedText">Locked <span class="miclub-icon icon-lock"></span></p></div><div class="lockTimer"></div></div>
<div class="col-lg-3 col-md-4 col-sm-4 col-xs-4 row-heading " id="heading-1522076067" >
<div class="row">
<div class="col-lg-4 col-md-4 col-sm-5 col-xs-5 row-heading-inner">
<h3>07:29 am</h3>
<h4>
Choose Me
<br/>
<span id="rule-name-row-1522076067" style="display: none">
</span>
</h4>
</div>
<div class="col-lg-8 col-md-8 col-sm-7 col-xs-7 row-heading-inner">
<button id="btn-book-group-1522076067"
class="btn btn-book-group hide"
title="Book Row" >
<span class="btn-label">BOOK GROUP</span>
</button>
<div class="row-information">
</div>
</div>
</div>
</div>

I guess this html is the each of the a in your code then you can exract the id and time with following code:
for a in browser.find_elements_by_css_selector(".inner-row"):
try:
el = a.find_element_by_css_selector("div.bananas")
print("id: %s", el.get_attribute("id").split("-")[1])
print("time: %s", [s for s in el.get_attribute("class")(" ") if "time-" in s][0].split("time-")[1])
except NoSuchElementException as e:
pass

You can get the ids like so
print([e.get_attribute('id') for e in driver.find_elements(By.CSS_SELECTOR, 'div.bananas')])
Prints ['row-1522076067']

To handle dynamic element Induce WebDriverWait() and wait for visibility_of_all_elements_located() and following css selector.
Then use regular expression to get the value from element attribute.
Code
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import re
driver=webdriver.Chrome()
driver.get("URL here")
elements=WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,"div[id^='bookingResults']>div.bananas")))
for element in elements:
print(re.findall("row-(\d+)",element.get_attribute("id"))[0])
classatr=element.get_attribute("class")
print(re.findall("time-(\d+)",classatr)[0])

Selenium's finding methods by class and tag show no elements present

I'm trying to extrapolate the preferences from my Netflix account with Selenium. Using the find_elements_by_class_name I managed to login, choose profile, open the account page and change the list from views to ratings, but I can't figure out how to select the movies from the table, since the aforementioned function doesn't show any result when used on their class or tag names.
This is the code I've written so far, and I've only got problems with the last line:
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
ch = Options()
ch.add_argument("--disable-extensions")
ch.add_argument("--disable-gpu")
ch.add_argument("--incognito")
browser = webdriver.Chrome(options = ch)
browser.get("https://www.netflix.com/login")
username = browser.find_element_by_id("id_userLoginId")
password = browser.find_element_by_id("id_password")
username.send_keys(input('Insert e-mail: '))
password.send_keys(getpass(prompt = "Insert password: "))
password.send_keys(Keys.ENTER)
profiles = browser.find_elements_by_class_name("profile-name")
print(profiles)
profiles[0].click()
browser.get("https://www.netflix.com/viewingactivity")
browser.find_element_by_class_name("choice.icon.rating").click()
print(browser.find_elements_by_class_name("retableRow"))
The Hmtl code I'm referring to is (sorry for the awful formatting):
<ul class="structural retable stdHeight">
<li class="retableRow">
<div class="col date nowrap">05/09/19
</div>
<div class="col title">
Watchmen</div><div class="col rating nowrap"><div class="thumbs-component thumbs thumbs-horizontal rated rated-up" data-uia="thumbs-container">
<div class="nf-svg-button-wrapper thumb-container thumb-up-container " data-uia="">
<a role="link" data-rating="0" tabindex="0" class="nf-svg-button simpleround" aria-label="Già valutato: pollice alzato (fai clic per rimuovere la valutazione)">
<svg data-rating="0" class="svg-icon svg-icon-thumb-up-filled" focusable="true">
<use filter="" xlink:href="#thumb-up-filled"></use></svg></a></div><div class="nf-svg-button-wrapper thumb-container thumb-down-container " data-uia="">
<a role="link" data-rating="1" tabindex="0" class="nf-svg-button simpleround" aria-label="Valutazione pollice verso">
<svg data-rating="1" class="svg-icon svg-icon-thumb-down" focusable="true"><use filter="" xlink:href="#thumb-down">
</use>
</svg>
</a>
</div>
<div class="nf-svg-button-wrapper clear-rating" data-uia="">
<a role="link" data-rating="0" data-clear-thumbs="true" tabindex="0" class="nf-svg-button simpleround" aria-label="Rimuovi la valutazione">
<svg data-rating="0" data-clear-thumbs="true" class="svg-icon svg-icon-close" focusable="true">
<use filter="" xlink:href="#close">
</use>
</svg>
</a>
</div>
</div>
</div>
</li>
It should print a list of all the elements of the class "retableRow", but iit prints an empty list instead. I've tried with the class "col.title" with similar results, and with the tag "li" that gave me totally different elements I'm not interested in. What am I doing wrong?

You are trying to find elements which are not there yet. Probably the page is updated via ajax calls or something.
browser.find_element_by_class_name("choice.icon.rating").click()
time.sleep(1)
print(browser.find_elements_by_class_name("retableRow"))
Ta-daam. Wait for it.
A bit more elegant approach would be to wait for element presence and then start parsing.
Example:
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
def wait_for_elem_by_xpath(xp):
elem = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xp)))
return elem
With this, replace your last line in your sample code to:
your_list = wait_for_elem_by_xpath('//*[#class="retableRow"]')
print(your_list)
And it will work.

Click ComboButton item with Selenium

I am trying to preform a simple click, but cannot find out what way to find it due to the type of element it is.
<div class="active">
<div class="action-title">Reconcile All</div>
<div class="action-description">Reconcile all IPv4 addresses</div>
</div>
<div class="active">
<img src="/images/icons/small/checks.gif" border="0">
</div>
I have tried doing it several ways. Such as,
driver.find_elements_by_link_text("Reconcile All").click()
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.LINK_TEXT, "Reconcile All"))).click()
I even tried based of the icon
driver.find_element_by_xpath("//*[contains(#src,'/images/icons/small/checks.gif')]").click()
Thanks in advance for any help

Div element can't click using link_text try Use following xpath and Webdriverwait to click.
WebDriverWait(driver,20).until(EC.element_to_be_clickable((By.XPATH,"//div[#class='active']//div[#class='action-title'][contains(.,'Reconcile All')]"))).click()

Get link when clicked in an image

This is the image where, when clicked, user is redirected to another page.
<div class="lis_el " id="cel_lisimg_18755" onclick="lis_mostrarficha(0);">
<div class="lis_elc ">
<div class="lis_eloverflow">
<div class="lis_elc_img">
<div class="lis_elc_imgc"><img class="lis_elc_img_img" id="lisimg_18755" src="https://sgfm.elcorteingles.es/SGFM/dctm/MEDIA03/201705/29/0280282401564764342_1_.jpg">
</div>
</div>
</div>
<div class="lis_info ">
<div class="clear"></div>
<div class="lis_info_precio">
5<span class="lis_info_preciop">,99€</span>
</div>
<h2>Camiseta flame</h2>
<div class="lis_mascol displaynone" id="lis_mascol18755" style="display: block;">+ Colores</div>
</div>
</div>
</div>
I'm tryin to obtain that link using Selenium in Python, but I don't know where I can obtain it from. I noticed this however, which I suppose this function does the redirection:
onclick="lis_mostrarficha(0);
I don't have much experience in web developing so I'm not sure how I can obtain that link without clicking, as this would take too long.
Thanks,

You will have to perform the click event in this case because the HTML does not contain the URL linked to the image -- it calls a script. What can be done is to use Selenium to click the element that contains the onclick event.
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
options = Options()
options.add_argument('--disable-infobars')
driver = webdriver.Chrome(chrome_options=options)
div = find_element_by_id('cel_lisimg_18755')
div.click()
# Then wait for the page to load
# Get the URL
url = driver.current_url
print(url) # Assumes v3 python

Selenium wait for element to be clickable python

All, I'm needing a little assistance with Selenium waits. I can't seem to figure out how to wait for an element to be ready.
The element that I am needing to wait I can locate and click using my script via the code below...
CreateJob = driver.find_element_by_xpath(".//*[#id='line']/div[1]/a")
or
CreateJob = driver.find_element_by_partial_link_text("Create Activity")
I'm needing to wait for this element to be on the page and clickable before I try to click on the element.
I can use the sleep command, but I have to wait for 5 seconds or more and it seems to be unreliable and errors out 1 out of 8 times or so.
I can't seem to find the correct syntax to use.
the HTML code for this is below.
<document>
<html manifest="https://tddf/index.php?m=manifest&a=index">
<head>
<body class="my-own-class mozilla mozilla48 mq1280 lt1440 lt1680 lt1920 themered" touch-device="not">
<noscript style="text-align: center; display: block;">Please enable JavaScript in your browser settings.</noscript>
<div id="wait" style="display: none;">
<div id="processing" class="hidden" style="display: none;"/>
<div id="loading" class="hidden" style="display: none;"/>
<div id="loadingPartsCatalog" class="hidden"/>
<div id="panel">
<div id="top-toolbar" class="hidden" style="display: block;">
<div id="commands-line" class="hidden" style="display: block;">
<div id="line">
<div class="action-link">
<a class="tap-active" href="#m=activity/a=set" action_link_label="create_activity" component_gui="action" component_type="action">Create Activity</a>
</div>
<div class="action-link">
<div class="action-link">
<div class="action-link">
</div>
<div id="commands-more" style="display: none;">
<div id="commands-list" class="hidden">
</div>
<div id="provider-search-bar" class="hidden center"

Here is a link to the 'waiting' section of the Python Selenium docs: http://selenium-python.readthedocs.io/waits.html#explicit-waits
You wait should look like this:
element = WebDriverWait(driver, 10).until(
EC.visibility_of((By.XPATH, ".//*[#id='line']/div[1]/a"))
)

I find this to be the easiest:
driver.implicitly_wait(10)
Where it waits for up to 10 seconds before the script might crash if expected conditions aren't met. I think it's better than always checking for the visibility of, the clickability of, or whatever it is about the element. Less effective and more error prone, however. So it would depend more on why you use selenium.
It also lets me cut down on try/except statements in my selenium scripts, and since I've found out about this I've reduced many time.sleep() functions as well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Choosing appropriate locators when scraping dynamic content with Python and Selenium - python

Related

Finding related fields / iterating through Selenium browser results

Selenium's finding methods by class and tag show no elements present

Click ComboButton item with Selenium

Get link when clicked in an image

Selenium wait for element to be clickable python

Categories

Resources