I am very new to web scraping. I am working on Selenium and want to perform the task to extract the texts from span tags. The tags do not have any class and ids. The span tags are inside the li tags. I need to extract the text from a span tags that are inside of the li tags. I don't know how to do that. Could you please help me with that?
HTML of the elements:
<div class="cmeStaticMediaBox cmeComponent section">
<div>
<ul class="cmeList">
<li class="cmeListContent cmeContentGroup">
<ul class="cmeHorizontalList cmeListSeparator">
<li>
<!-- Default clicked -->
<span>VOI By Exchange</span>
</li>
<li>
<a href="https://www.cmegroup.com/market-data/volume-open-interest/agriculture-commodities-volume.html" class="none" target="_self">
<span>Agricultural</span></a>
</li>
<li>
<a href="https://www.cmegroup.com/market-data/volume-open-interest/energy-volume.html" class="none" target="_self">
<span>Energy</span></a>
</li>
</ul>
</li>
</ul>
</div>
</div>
The simplest way to do this is
for e in driver.find_elements(By.CSS_SELECTOR, "ul.cmeHorizontalList a")
print(e.text)
Some pitfalls in other answers...
You shouldn't use exceptions to control flow. It's just a bad practice and is slower.
You shouldn't use Copy > XPath from a browser. Most times this generates XPaths that are very brittle. Any XPath that starts at the HTML tag, has more than a few levels, or uses a number of indices (e.g. div[2] and the like) is going to be very brittle. Any even minor change to the page will break that locator.
Prefer CSS selectors over XPath. CSS selectors are better supported, faster, and the syntax is simpler.
EDIT
Since you need to use selenium, you can use XPATHs to locate elements when you don't have a tag on which you can refer to. From your favorite browser just F12, then right-click on the interested element and choose "Copy -> XPath". This is the solution proposed (I assume you have chrome and the chromedriver in the same folder of the .py file):
import os
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium import webdriver
url = "https://www.cmegroup.com/market-data/volume-open-interest/metals-volume.html"
i = 1
options = webdriver.ChromeOptions()
# this flag won't open a browser window, if you don't need the dev window uncomment this line
# options.add_argument("--headless")
driver = webdriver.Chrome(
options=options, executable_path=os.getcwd() + "/chromedriver.exe"
)
driver.get(url)
while True:
xpath = f"/html/body/div[1]/div[2]/div/div[2]/div[2]/div/ul/li/ul/li[{i}]/a/span"
try:
res = driver.find_element(By.XPATH, xpath)
except NoSuchElementException:
# There are no more span elements in li
break
print(res.text)
i += 1
Results:
VOI By Exchange
Agricultural
Energy
Equities
FX
Interest Rates
You can extend this snippet to handle the .csv download from each page.
OLD
If you are working with a static html page (like the one you provided in the question) I suggest you to use BeautifulSoup. Selenium is more suited if you have to click, fill forms or interact with a web page. Here's a snippet with my solution:
from bs4 import BeautifulSoup
html_doc = """
<div class="cmeStaticMediaBox cmeComponent section">
<div>
<ul class="cmeList">
<li class="cmeListContent cmeContentGroup">
<ul class="cmeHorizontalList cmeListSeparator">
<li>
<!-- Default clicked -->
<span>VOI By Exchange</span>
</li>
<li>
<a href="https://www.cmegroup.com/market-data/volume-open-interest/agriculture-commodities-volume.html"
class="none" target="_self">
<span>Agricultural</span></a>
</li>
<li>
<a href="https://www.cmegroup.com/market-data/volume-open-interest/energy-volume.html" class="none"
target="_self">
<span>Energy</span></a>
</li>
</ul>
</li>
</ul>
</div>
</div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
for span in soup.find_all("span"):
print(span.text)
And the result will be:
VOI By Exchange
Agricultural
Energy
To extract the desired texts e.g. VOI By Exchange, Agricultural, Energy, etc you need to induce WebDriverWait for visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get('https://www.cmegroup.com/market-data/volume-open-interest/exchange-volume.html')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#onetrust-accept-btn-handler"))).click()
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.cmeHorizontalList.cmeListSeparator li span")))])
Using XPATH:
driver.get('https://www.cmegroup.com/market-data/volume-open-interest/exchange-volume.html')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[#id='onetrust-accept-btn-handler']"))).click()
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[#class='cmeHorizontalList cmeListSeparator']//li//span")))])
Console Output:
['VOI By Exchange', 'Agricultural', 'Energy', 'Equities', 'FX', 'Interest Rates', 'Metals']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Related
I am trying to click a button on a website in Python using Selenium. The html for the button I want looks like this:
<a onclick="exportGridExcel()"><span class="accordion-download"><i class="fa fa-file-excel-o" title="Download Excel" aria-hidden="true"></i></span></a>
A more expanded version of the html is:
<div class="lang-switch-wrapper lang-switch-inverse-wrapper">
<div class="lang-switch">
<ul>
<li class="dropdown">
<a href="#" class="dropdown-toggle lang-lable" data-toggle="dropdown" role="button" aria-haspopup="true" aria-expanded="false">
<span class=" hidden-xs">English</span>
<span class="hidden-lg hidden-md hidden-sm">EN</span>
<img src="/etc/designs/wbrrdesign/clientlibs-wbrredsign/img/angle-down-gray.svg" alt="dropdown"></a>
<ul class="dropdown-menu dropdown-item">
<li>Español</li>
</ul>
</li>
</ul>
</div>
</div>
Then some other stuff before going to the button group and button I want:
<div class="button-group">
<button onclick="onModifyQuery()" type="button" class="btn">Modify Query</button>
<a onclick="exportGridExcel()"><span class="accordion-download"><i class="fa fa-file-excel-o" title="Download Excel" aria-hidden="true"></i></span></a>
<a title="Undo to column removal" onclick="restoreColumn()" class="toggle-btn primary-light-blue-btn"><i class="fa fa-circle-o-notch" aria-hidden="true"></i></a>
</div>
Part of my confusion is inexperience with this html with multiple classes and "i class".
EDIT:
If I try something like:
WebDriverWait(driver,300).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[onclick='exportGridExcel()']"))).click()
Then I get the error:
ElementClickInterceptedException: element click intercepted: Element <a onclick="exportGridExcel()">...</a> is not clickable at point (772, 11). Other element would receive the click: <div class="lang-switch-wrapper lang-switch-inverse-wrapper">...</div>
The problem is that your page is automatically scrolled up and the excel download button is probably hidden by the top banner that contains the language selector. When Selenium tries to click on the excel download button, it finds the language selector instead.
I would suggest you to scroll the page up till the whole data table is visible
You can use something like this. Press HOME key to go to top of the page
from selenium.webdriver.common.keys import Keys
...
...
element = WebDriverWait(driver,300).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[onclick='exportGridExcel()']")))
element.send_keys(Keys.CONTROL + Keys.HOME)
element.click()
Ideally clicking on the element <a onclick="exportGridExcel()"> as per your code trials inducing WebDriverWait for the elementToBeClickable() should have worked.
However, there seems to be a decendant <span>. So you can use either of the following locator strategies:
Using CSS_SELECTOR:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[onclick^='exportGridExcel'] > span.accordion-download"))).click()
Using XPATH:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[starts-with(#onclick, 'exportGridExcel')]/span[#class='accordion-download']"))).click()
Note: You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
In another approach you can also use JavascriptExecutor as follows:
Using CSS_SELECTOR:
element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[onclick^='exportGridExcel'] > span.accordion-download")))
driver.execute_script("arguments[0].click();", element)
Using XPATH:
element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[starts-with(#onclick, 'exportGridExcel')]/span[#class='accordion-download']")))
driver.execute_script("arguments[0].click();", element)
I'm using selenium for web scraping, and I have the following HTML:
<a data-field="---" class="---" target="---" href="https://www.example.com/0000/">
<div class="display-flex align-items-center">
<span class="mr1 hoverable-link-text t-bold">
How can I access the href link and click on it?
I used the following and didn't get results, would be great to understand why:
browser.find_element(By.PARTIAL_LINK_TEXT, "https://www.example.com")
browser.find_element(By.XPATH,"//a[contains(text(),'https://www.example.com')]")
Thanks!
Edit:
The page I'm working on is the LinkedIn interests page (companies that I follow). You can find it on: https://www.linkedin.com/in/yourusername/details/interests/?detailScreenTabIndex=1
For each company I follow, there is an HTML:
<a data-field="active_tab_companies_interests" class="optional-action-target-wrapper
display-flex flex-column full-width" target="_self" href="https://www.linkedin.com/company/1016/">
<div class="display-flex align-items-center">
<span class="mr1 hoverable-link-text t-bold">
<span aria-hidden="true"><!---->GE Healthcare<!----></span><span class="visually-hidden"><!---->GE Healthcare<!----></span>
</span>
<!----><!----><!----> </div>
<!----> <span class="t-14 t-normal t-black--light">
<span aria-hidden="true"><!---->1,851,945 followers<!----></span><span class="visually-hidden"><!---->1,851,945 followers<!----></span>
</span>
<!----> </a>
I want to find href, in my example: "https://www.linkedin.com/company/1016/"
The code I wrote (with the help of the comments):
# log in
browser.get("https://www.linkedin.com")
username = browser.find_element(By.ID,"session_key")
username.send_keys("youremail")
password = browser.find_element(By.ID,"session_password")
password.send_keys("yourpassword")
login_button = browser.find_element(By.CLASS_NAME, "sign-in-form__submit-button")
login_button.click()
# companies I follow on Linkedin
browser.get("https://www.linkedin.com/in/username/details/interests/?detailScreenTabIndex=1")
# find all company links
wait = WebDriverWait(browser, 20)
company_page = browser.find_elements(By.XPATH,"//a[contains(#href,'https://www.linkedin.com/company/')]")
for x in range (len(company_page)):
print(company_page[x].text)
The output for "GE healthcare" (from the HTML snippet) is:
GE Healthcare
GE Healthcare
1,852,718 followers
1,852,718 followers
and not the href link that I'm looking for. I don't understand why it finds these texts and not the link.
Thanks!
https://www.example.com/0000/ is not a text attribute content. It is a value of href attribute. This is why both you locators are wrong.
Please try this:
browser.find_element(By.XPATH,"//a[contains(#href,'https://www.example.com')]")
Adding a .click() will probably click on that element, as following:
browser.find_element(By.XPATH,"//a[contains(#href,'https://www.example.com')]").click()
You may probably will need to add a delay to wait for the element to be clickable. In this case WebDriverWait expected conditions is the right way to do it, as following:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(browser, 20)
wait.until(EC.element_to_be_clickable((By.XPATH, "//a[contains(#href,'https://www.example.com')]"))).click()
I am trying to achieve multiple div to get a click in the same class but it's ain't so helpful
[Sit Plan][1]
https://ibb.co/SfdT2LW
This is the code
<a href="javascript:void(0)" class="seat" title="[GHA-15]" data-toggle="tooltip" data-placement="bottom" onclick="chooseSeat(this)">0-4-5
<div class="spinner">
<div class="double-bounce1"></div>
<div class="double-bounce2"></div>
</div>
</a>
<a href="javascript:void(0)" class="seat" title="[GHA-14]" data-toggle="tooltip" data-placement="bottom" onclick="chooseSeat(this)">0-2-5
<div class="spinner">
<div class="double-bounce1"></div>
<div class="double-bounce2"></div>
</div>
</a>
**That's how I tried . but work for single div only **
'''
div = driver.find_element_by_class_name("spinner")
div.click()'''
This is what i tried from Web. but is'nt helping
'''
div1 = driver.find_elements_by_xpath('//a[#class="seat"]//preceding-sibling::td[#div="spinner"]')
# div1.click()
'''
to differentiate between
0-4-5 and 0-2-5 you can simply use the title attribute(xpath) from the shared HTML.
//a[#title='[GHA-15]']
should represent 0-4-5
//a[#title='[GHA-14]'] for 0-2-5
Click it like:
driver.find_element(By.XPATH, "//a[#title='[GHA-15]']").click()
or
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//a[#title='[GHA-14]']"))).click()
Update:
There are multiple ways to click on the spinner element.
Use XPath-indexing:
driver.find_element(By.XPATH, "(//div[#class='spinner'])[1]").click()
or
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "(//div[#class='spinner'])[1]"))).click()
Use find_elements
elements = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[#class='spinner']")))
elements[0].click()
or
elements[1].click()
etc.
click in a loop:
elements = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[#class='spinner']")))
for element in elements:
element.click()
time.sleep(3)
Imports:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Using following way you can select the required div when the multiple div have the same class name
select=browser.find_element_by_xpath('//*[#class="class name"][index of div]')
Here class name is the name of div class and the index is the index of div which you want to select it start from 1 to onward
get all element with same class using:
elms = driver.find_elements(By.CLASS_NAME, "seat")
then iterate through the list to get what you are looking for
for elem in elems:
if # your condition:
# do your stuff
I am trying to access a div where all divs have the same name. Let me explain. I am just starting out with selenium and python and I am trying to scrape a webpage just to learn. I ran into the following problem. I made the example html to show the buildup of the webpage. All the divs have the exact same class and title. Then there is the h1 tag for the item and the p tag for the color (which is a clickable link). I am trying to search a page when you give it certain instructions. Example: I am looking for a white racebike. I am able to find the bikes with the first line of code, but how do I find the right color within the racebike section? If I run the Python mentioned below I get an error message. Thanks in advance!
<!DOCTYPE html>
<html>
<body>
<div class=div title=example>
<h1>racebike</h1>
<p class='test'>black</p>
</div>
<div class=div title=example>
<h1>racebike</h1>
<p class='test'>white</p>
</div>
<div class=div title=example>
<h1>racebike</h1>
<p class='test'>yellow</p>
</div>
<div class=div title=example>
<h1>citybike</h1>
<p class='test'>yellow</p>
</div>
<div class=div title=example>
<h1>citybike</h1>
<p class='test'>green</p>
</div>
</body>
</html>
test = (self.driver.find_element_by_xpath("//*[contains(text(), racebike)]"))
test.self.driver.find_element_by_xpath(".//*[contains(text(), white)]").click
To locate/click() on the white racebike element you need to induce WebDriverWait for the element_to_be_clickable() and you can use either of the following xpath based Locator Strategies:
Using XPATH:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//h1[text()='racebike']//following-sibling::p[#class='test' and text()='white']"))).click()
Using XPATH considering the parent <div>:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#class='div' and #title='example']/h1[text()='racebike']//following-sibling::p[#class='test' and text()='white']"))).click()
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can use same xpath which you tried in your solution. It might be possible server is taking too long to repond.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
element = WebDriverWait(page, 10).until(EC.presence_of_element_located((By.XPATH, "//p[contains(#class, 'white')]")))
element.click()
for multiple bikes with whiite color
elements= WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.XPATH, "//p[contains(#class, 'white')]")))
for element in elements:
element.click()
I´m trying to get the text inside a /a tag in a nested ul-li structure. I locate all the "li", but can´t get the text inside a's.
I´m using Python 3.7 and Selenium webdriver with Firefox driver.
The corresponding HTML is:
[some HTML]
<ul class="dropdown-menu inner">
<!---->
<li nya-bs-option="curso in ctrl.cursos group by curso.grupo" class="nya-bs-option first-in-group group-item">
<span class="dropdown-header">Cursos em Destaque</span>
<a tabindex="0">Important TEXT 1</a>
</li>
<!-- end nyaBsOption: curso in ctrl.cursos group by curso.grupo -->
<li nya-bs-option="curso in ctrl.cursos group by curso.grupo" class="nya-bs-option group-item">
<span class="dropdown-header">Cursos em Destaque</span>
<a tabindex="0">Important TEXT 2</a>
</li>
<!-- end nyaBsOption: curso in ctrl.cursos group by curso.grupo -->
<li nya-bs-option="curso in ctrl.cursos group by curso.grupo" class="nya-bs-option group-item">
<span class="dropdown-header">Cursos em Destaque</span>
<a tabindex="0">Important TEXT 3</a>
</li>
<!-- end nyaBsOption: curso in ctrl.cursos group by curso.grupo -->
<li nya-bs-option="curso in ctrl.cursos group by curso.grupo" class="nya-bs-option group-item">
<span class="dropdown-header">Cursos em Destaque</span>
<a tabindex="0">Important TEXT4</a>
</li>
[another 100 <li></li> similar blocks] .
.
<li class="no-search-result" placeholder="Curso">
<span>Unimportant TEXT</span>
</li>
</ul>
[more HTML]
I´ve tried the code below:
cursos = browser.find_elements_by_xpath('//li[#nya-bs-option="curso in ctrl.cursos group by curso.grupo"]')
nome_curso = [curso.find_element_by_tag_name('a').text for curso in cursos]
I get the list with the correct number of items, but all of them = ''. Can anyone help me? Thks.
Seems you were close. To extract the texts, e.g. Important TEXT 1, Important TEXT 2, Important TEXT 3, Important TEXT4, etc you have to induce WebDriverWait for the desired visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR and get_attribute() method:
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "ul.dropdown-menu.inner li.nya-bs-option a")))])
Using XPATH and text attribute:
print([my_elem.text for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//ul[#class='dropdown-menu inner']//li[contains(#class, 'nya-bs-option')]//a")))])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
You can find a relevant discussion in How to retrieve the title attribute through Selenium using Python?
Outro
As per the documentation:
get_attribute() method Gets the given attribute or property of the element.
text attribute returns The text of the element.
Difference between text and innerHTML using Selenium