Extracting texts from <li> items with selenium in Python

Extracting texts from <li> items with selenium in Python - python

I´m trying to get the text inside a /a tag in a nested ul-li structure. I locate all the "li", but can´t get the text inside a's.
I´m using Python 3.7 and Selenium webdriver with Firefox driver.
The corresponding HTML is:
[some HTML]
<ul class="dropdown-menu inner">
<!---->
<li nya-bs-option="curso in ctrl.cursos group by curso.grupo" class="nya-bs-option first-in-group group-item">
<span class="dropdown-header">Cursos em Destaque</span>
<a tabindex="0">Important TEXT 1</a>
</li>
<!-- end nyaBsOption: curso in ctrl.cursos group by curso.grupo -->
<li nya-bs-option="curso in ctrl.cursos group by curso.grupo" class="nya-bs-option group-item">
<span class="dropdown-header">Cursos em Destaque</span>
<a tabindex="0">Important TEXT 2</a>
</li>
<!-- end nyaBsOption: curso in ctrl.cursos group by curso.grupo -->
<li nya-bs-option="curso in ctrl.cursos group by curso.grupo" class="nya-bs-option group-item">
<span class="dropdown-header">Cursos em Destaque</span>
<a tabindex="0">Important TEXT 3</a>
</li>
<!-- end nyaBsOption: curso in ctrl.cursos group by curso.grupo -->
<li nya-bs-option="curso in ctrl.cursos group by curso.grupo" class="nya-bs-option group-item">
<span class="dropdown-header">Cursos em Destaque</span>
<a tabindex="0">Important TEXT4</a>
</li>
[another 100 <li></li> similar blocks] .
.
<li class="no-search-result" placeholder="Curso">
<span>Unimportant TEXT</span>
</li>
</ul>
[more HTML]
I´ve tried the code below:
cursos = browser.find_elements_by_xpath('//li[#nya-bs-option="curso in ctrl.cursos group by curso.grupo"]')
nome_curso = [curso.find_element_by_tag_name('a').text for curso in cursos]
I get the list with the correct number of items, but all of them = ''. Can anyone help me? Thks.

Seems you were close. To extract the texts, e.g. Important TEXT 1, Important TEXT 2, Important TEXT 3, Important TEXT4, etc you have to induce WebDriverWait for the desired visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR and get_attribute() method:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.dropdown-menu.inner li.nya-bs-option a")))])
Using XPATH and text attribute:
print([my_elem.text for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[#class='dropdown-menu inner']//li[contains(#class, 'nya-bs-option')]//a")))])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in How to retrieve the title attribute through Selenium using Python?
Outro
As per the documentation:
get_attribute() method Gets the given attribute or property of the element.
text attribute returns The text of the element.
Difference between text and innerHTML using Selenium

Related

Selenium click button with <a> tag - intercepted exception

I am trying to click a button on a website in Python using Selenium. The html for the button I want looks like this:
<a onclick="exportGridExcel()"><span class="accordion-download"><i class="fa fa-file-excel-o" title="Download Excel" aria-hidden="true"></i></span></a>
A more expanded version of the html is:
<div class="lang-switch-wrapper lang-switch-inverse-wrapper">
<div class="lang-switch">
<ul>
<li class="dropdown">
<a href="#" class="dropdown-toggle lang-lable" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">
<span class=" hidden-xs">English</span>
<span class="hidden-lg hidden-md hidden-sm">EN</span>
<img src="/etc/designs/wbrrdesign/clientlibs-wbrredsign/img/angle-down-gray.svg" alt="dropdown"></a>
<ul class="dropdown-menu dropdown-item">
<li>Español</li>
</ul>
</li>
</ul>
</div>
</div>
Then some other stuff before going to the button group and button I want:
<div class="button-group">
<button onclick="onModifyQuery()" type="button" class="btn">Modify Query</button>
<a onclick="exportGridExcel()"><span class="accordion-download"><i class="fa fa-file-excel-o" title="Download Excel" aria-hidden="true"></i></span></a>
<a title="Undo to column removal" onclick="restoreColumn()" class="toggle-btn primary-light-blue-btn"><i class="fa fa-circle-o-notch" aria-hidden="true"></i></a>
</div>
Part of my confusion is inexperience with this html with multiple classes and "i class".
EDIT:
If I try something like:
WebDriverWait(driver,300).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[onclick='exportGridExcel()']"))).click()
Then I get the error:
ElementClickInterceptedException: element click intercepted: Element <a onclick="exportGridExcel()">...</a> is not clickable at point (772, 11). Other element would receive the click: <div class="lang-switch-wrapper lang-switch-inverse-wrapper">...</div>

The problem is that your page is automatically scrolled up and the excel download button is probably hidden by the top banner that contains the language selector. When Selenium tries to click on the excel download button, it finds the language selector instead.
I would suggest you to scroll the page up till the whole data table is visible
You can use something like this. Press HOME key to go to top of the page
from selenium.webdriver.common.keys import Keys
...
...
element = WebDriverWait(driver,300).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[onclick='exportGridExcel()']")))
element.send_keys(Keys.CONTROL + Keys.HOME)
element.click()

Ideally clicking on the element <a onclick="exportGridExcel()"> as per your code trials inducing WebDriverWait for the elementToBeClickable() should have worked.
However, there seems to be a decendant <span>. So you can use either of the following locator strategies:
Using CSS_SELECTOR:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[onclick^='exportGridExcel'] > span.accordion-download"))).click()
Using XPATH:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[starts-with(#onclick, 'exportGridExcel')]/span[#class='accordion-download']"))).click()
Note: You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
In another approach you can also use JavascriptExecutor as follows:
Using CSS_SELECTOR:
element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[onclick^='exportGridExcel'] > span.accordion-download")))
driver.execute_script("arguments[0].click();", element)
Using XPATH:
element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[starts-with(#onclick, 'exportGridExcel')]/span[#class='accordion-download']")))
driver.execute_script("arguments[0].click();", element)

How can I access href that is inside a data-field, using selenium (Python)?

I'm using selenium for web scraping, and I have the following HTML:
<a data-field="---" class="---" target="---" href="https://www.example.com/0000/">
<div class="display-flex align-items-center">
<span class="mr1 hoverable-link-text t-bold">
How can I access the href link and click on it?
I used the following and didn't get results, would be great to understand why:
browser.find_element(By.PARTIAL_LINK_TEXT, "https://www.example.com")
browser.find_element(By.XPATH,"//a[contains(text(),'https://www.example.com')]")
Thanks!
Edit:
The page I'm working on is the LinkedIn interests page (companies that I follow). You can find it on: https://www.linkedin.com/in/yourusername/details/interests/?detailScreenTabIndex=1
For each company I follow, there is an HTML:
<a data-field="active_tab_companies_interests" class="optional-action-target-wrapper
display-flex flex-column full-width" target="_self" href="https://www.linkedin.com/company/1016/">
<div class="display-flex align-items-center">
<span class="mr1 hoverable-link-text t-bold">
<span aria-hidden="true"><!---->GE Healthcare<!----></span><span class="visually-hidden"><!---->GE Healthcare<!----></span>
</span>
<!----><!----><!----> </div>
<!----> <span class="t-14 t-normal t-black--light">
<span aria-hidden="true"><!---->1,851,945 followers<!----></span><span class="visually-hidden"><!---->1,851,945 followers<!----></span>
</span>
<!----> </a>
I want to find href, in my example: "https://www.linkedin.com/company/1016/"
The code I wrote (with the help of the comments):
# log in
browser.get("https://www.linkedin.com")
username = browser.find_element(By.ID,"session_key")
username.send_keys("youremail")
password = browser.find_element(By.ID,"session_password")
password.send_keys("yourpassword")
login_button = browser.find_element(By.CLASS_NAME, "sign-in-form__submit-button")
login_button.click()
# companies I follow on Linkedin
browser.get("https://www.linkedin.com/in/username/details/interests/?detailScreenTabIndex=1")
# find all company links
wait = WebDriverWait(browser, 20)
company_page = browser.find_elements(By.XPATH,"//a[contains(#href,'https://www.linkedin.com/company/')]")
for x in range (len(company_page)):
print(company_page[x].text)
The output for "GE healthcare" (from the HTML snippet) is:
GE Healthcare
GE Healthcare
1,852,718 followers
1,852,718 followers
and not the href link that I'm looking for. I don't understand why it finds these texts and not the link.
Thanks!

https://www.example.com/0000/ is not a text attribute content. It is a value of href attribute. This is why both you locators are wrong.
Please try this:
browser.find_element(By.XPATH,"//a[contains(#href,'https://www.example.com')]")
Adding a .click() will probably click on that element, as following:
browser.find_element(By.XPATH,"//a[contains(#href,'https://www.example.com')]").click()
You may probably will need to add a delay to wait for the element to be clickable. In this case WebDriverWait expected conditions is the right way to do it, as following:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(browser, 20)
wait.until(EC.element_to_be_clickable((By.XPATH, "//a[contains(#href,'https://www.example.com')]"))).click()

Selenium Python: Extracting text from elements having no class

I am very new to web scraping. I am working on Selenium and want to perform the task to extract the texts from span tags. The tags do not have any class and ids. The span tags are inside the li tags. I need to extract the text from a span tags that are inside of the li tags. I don't know how to do that. Could you please help me with that?
HTML of the elements:
<div class="cmeStaticMediaBox cmeComponent section">
<div>
<ul class="cmeList">
<li class="cmeListContent cmeContentGroup">
<ul class="cmeHorizontalList cmeListSeparator">
<li>
<!-- Default clicked -->
<span>VOI By Exchange</span>
</li>
<li>
<a href="https://www.cmegroup.com/market-data/volume-open-interest/agriculture-commodities-volume.html" class="none" target="_self">
<span>Agricultural</span></a>
</li>
<li>
<a href="https://www.cmegroup.com/market-data/volume-open-interest/energy-volume.html" class="none" target="_self">
<span>Energy</span></a>
</li>
</ul>
</li>
</ul>
</div>
</div>

The simplest way to do this is
for e in driver.find_elements(By.CSS_SELECTOR, "ul.cmeHorizontalList a")
print(e.text)
Some pitfalls in other answers...
You shouldn't use exceptions to control flow. It's just a bad practice and is slower.
You shouldn't use Copy > XPath from a browser. Most times this generates XPaths that are very brittle. Any XPath that starts at the HTML tag, has more than a few levels, or uses a number of indices (e.g. div[2] and the like) is going to be very brittle. Any even minor change to the page will break that locator.
Prefer CSS selectors over XPath. CSS selectors are better supported, faster, and the syntax is simpler.

EDIT
Since you need to use selenium, you can use XPATHs to locate elements when you don't have a tag on which you can refer to. From your favorite browser just F12, then right-click on the interested element and choose "Copy -> XPath". This is the solution proposed (I assume you have chrome and the chromedriver in the same folder of the .py file):
import os
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium import webdriver
url = "https://www.cmegroup.com/market-data/volume-open-interest/metals-volume.html"
i = 1
options = webdriver.ChromeOptions()
# this flag won't open a browser window, if you don't need the dev window uncomment this line
# options.add_argument("--headless")
driver = webdriver.Chrome(
options=options, executable_path=os.getcwd() + "/chromedriver.exe"
)
driver.get(url)
while True:
xpath = f"/html/body/div[1]/div[2]/div/div[2]/div[2]/div/ul/li/ul/li[{i}]/a/span"
try:
res = driver.find_element(By.XPATH, xpath)
except NoSuchElementException:
# There are no more span elements in li
break
print(res.text)
i += 1
Results:
VOI By Exchange
Agricultural
Energy
Equities
FX
Interest Rates
You can extend this snippet to handle the .csv download from each page.
OLD
If you are working with a static html page (like the one you provided in the question) I suggest you to use BeautifulSoup. Selenium is more suited if you have to click, fill forms or interact with a web page. Here's a snippet with my solution:
from bs4 import BeautifulSoup
html_doc = """
<div class="cmeStaticMediaBox cmeComponent section">
<div>
<ul class="cmeList">
<li class="cmeListContent cmeContentGroup">
<ul class="cmeHorizontalList cmeListSeparator">
<li>
<!-- Default clicked -->
<span>VOI By Exchange</span>
</li>
<li>
<a href="https://www.cmegroup.com/market-data/volume-open-interest/agriculture-commodities-volume.html"
class="none" target="_self">
<span>Agricultural</span></a>
</li>
<li>
<a href="https://www.cmegroup.com/market-data/volume-open-interest/energy-volume.html" class="none"
target="_self">
<span>Energy</span></a>
</li>
</ul>
</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
for span in soup.find_all("span"):
print(span.text)
And the result will be:
VOI By Exchange
Agricultural
Energy

To extract the desired texts e.g. VOI By Exchange, Agricultural, Energy, etc you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get('https://www.cmegroup.com/market-data/volume-open-interest/exchange-volume.html')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.cmeHorizontalList.cmeListSeparator li span")))])
Using XPATH:
driver.get('https://www.cmegroup.com/market-data/volume-open-interest/exchange-volume.html')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[#id='onetrust-accept-btn-handler']"))).click()
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[#class='cmeHorizontalList cmeListSeparator']//li//span")))])
Console Output:
['VOI By Exchange', 'Agricultural', 'Energy', 'Equities', 'FX', 'Interest Rates', 'Metals']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Selenium's finding methods by class and tag show no elements present

I'm trying to extrapolate the preferences from my Netflix account with Selenium. Using the find_elements_by_class_name I managed to login, choose profile, open the account page and change the list from views to ratings, but I can't figure out how to select the movies from the table, since the aforementioned function doesn't show any result when used on their class or tag names.
This is the code I've written so far, and I've only got problems with the last line:
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium import webdriver
ch = Options()
ch.add_argument("--disable-extensions")
ch.add_argument("--disable-gpu")
ch.add_argument("--incognito")
browser = webdriver.Chrome(options = ch)
browser.get("https://www.netflix.com/login")
username = browser.find_element_by_id("id_userLoginId")
password = browser.find_element_by_id("id_password")
username.send_keys(input('Insert e-mail: '))
password.send_keys(getpass(prompt = "Insert password: "))
password.send_keys(Keys.ENTER)
profiles = browser.find_elements_by_class_name("profile-name")
print(profiles)
profiles[0].click()
browser.get("https://www.netflix.com/viewingactivity")
browser.find_element_by_class_name("choice.icon.rating").click()
print(browser.find_elements_by_class_name("retableRow"))
The Hmtl code I'm referring to is (sorry for the awful formatting):
<ul class="structural retable stdHeight">
<li class="retableRow">
<div class="col date nowrap">05/09/19
</div>
<div class="col title">
Watchmen</div><div class="col rating nowrap"><div class="thumbs-component thumbs thumbs-horizontal rated rated-up" data-uia="thumbs-container">
<div class="nf-svg-button-wrapper thumb-container thumb-up-container " data-uia="">
<a role="link" data-rating="0" tabindex="0" class="nf-svg-button simpleround" aria-label="Già valutato: pollice alzato (fai clic per rimuovere la valutazione)">
<svg data-rating="0" class="svg-icon svg-icon-thumb-up-filled" focusable="true">
<use filter="" xlink:href="#thumb-up-filled"></use></svg></a></div><div class="nf-svg-button-wrapper thumb-container thumb-down-container " data-uia="">
<a role="link" data-rating="1" tabindex="0" class="nf-svg-button simpleround" aria-label="Valutazione pollice verso">
<svg data-rating="1" class="svg-icon svg-icon-thumb-down" focusable="true"><use filter="" xlink:href="#thumb-down">
</use>
</svg>
</a>
</div>
<div class="nf-svg-button-wrapper clear-rating" data-uia="">
<a role="link" data-rating="0" data-clear-thumbs="true" tabindex="0" class="nf-svg-button simpleround" aria-label="Rimuovi la valutazione">
<svg data-rating="0" data-clear-thumbs="true" class="svg-icon svg-icon-close" focusable="true">
<use filter="" xlink:href="#close">
</use>
</svg>
</a>
</div>
</div>
</div>
</li>
It should print a list of all the elements of the class "retableRow", but iit prints an empty list instead. I've tried with the class "col.title" with similar results, and with the tag "li" that gave me totally different elements I'm not interested in. What am I doing wrong?

You are trying to find elements which are not there yet. Probably the page is updated via ajax calls or something.
browser.find_element_by_class_name("choice.icon.rating").click()
time.sleep(1)
print(browser.find_elements_by_class_name("retableRow"))
Ta-daam. Wait for it.
A bit more elegant approach would be to wait for element presence and then start parsing.
Example:
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
def wait_for_elem_by_xpath(xp):
elem = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, xp)))
return elem
With this, replace your last line in your sample code to:
your_list = wait_for_elem_by_xpath('//*[#class="retableRow"]')
print(your_list)
And it will work.

How to set the xpath for a element in html using python selenium webdriver

<div id="div_1" class="info-container">
<div class="name" data-toggle="dropdown"><div>
<div class="name" data-toggle="dropdown"><div>
<div id="div_2" class="btn-group user-helper-dropdown">
<i class="material-icons" data-toggle="dropdown"
aria-haspopup="true" aria-expanded="true"></i>
<ul id="div_3" class="dropdown-menu pull-right">
<li><a href="/profile/user/view"><i class="material
icons">person</i>Profile</a></li>
<li role="seperator" class="divider"></li>
<li id="sgn_out"><a id="sign_out_button" href="/sign-out"><i
class="material-icons">input</i>Sign Out</a></li>
</ul>
</div>
<div>
Hello I am having trouble in find the XPath for the tag which has id='sign_out_button' in selenium webdriver in Python.
This is the what I have tried:
driver.find_element_by_xpath("//div[#id='div_1']/div[#id='div_2']/ul[#id='div_3']/li[#id='sgn_out']/a")
But it gives me an error like this:
selenium.common.exceptions.NoSuchElementException: Message: no such element:
Unable to locate element:
{"method":"xpath","selector":"//div[#id='div_1']/div[#id='div_2']/ul[#id='div_3']/li[#id='sgn_out']/a"}

As per the HTML you have shared to invoke click() on the element with id='sign_out_button' as it is a dropdown item you have to induce WebDriverWait for the element to be clickable as follows:
CSS_SELECTOR:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "ul.dropdown-menu#div_3 li#sgn_out>a#sign_out_button"))).click()
XPATH:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//ul[#class='dropdown-menu pull-right' and #id='div_3']//li[#id='sgn_out']/a[#id='sign_out_button']"))).click()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting texts from <li> items with selenium in Python - python

Related

Selenium click button with <a> tag - intercepted exception

How can I access href that is inside a data-field, using selenium (Python)?

Selenium Python: Extracting text from elements having no class

Selenium's finding methods by class and tag show no elements present

How to set the xpath for a element in html using python selenium webdriver

Categories

Resources