Read Description list in selenium python - python

How can I read text of <dd> tag which has <dt> like Commodity code.
`<dl class="dl">
<dt>Trading Screen Product Name</dt>
<dd>Biodiesel Futures (balmo)</dd>
<dt>Trading Screen Hub Name</dt>
<dd>Soybean Oil Pen 1st Line</dd>
<dt>Commodity Code</dt>
<dd><div>S25-S2Z</div></dd>
<dt>Contract Size</dt>
<dd><div>100 metric tonnes (220,462 pounds)</div></dd>
</dl>`
from selenium import webdriver
driver = webdriver.Chrome("C:\\Python36-32\\selenium\\webdriver\\chromedriver.exe")
link_list = ["http://www.theice.com/products/31500922","http://www.theice.com/products/243"]
driver.maximize_window()
for link in link_list:
driver.get(link)
desc_list = driver.find_elements_by_class_name("dl")

Try to implement below code to get the values of "Commodity Code" as output:
for desc_list in driver.find_elements_by_class_name("dl"):
print(desc_list.find_element_by_xpath("./dt[.='Commodity Code']/following-sibling::dd").text)

To extract/read the text of <dd> tag which has child <div> tag e.g. S25-S2Z you can create a List of the desired elements and then print the text within the elements and you can use the following solution:
for element in driver.find_elements_by_xpath("//dl[#class='dl']//dd/div"):
list_dd.get_attribute("innerHTML")

Related

Get values from CSS span element with constantly changing values

I am trying to scrape a website that seems to use different values each time a particular span element appears. For example, the first few times the span element appears, it could be:
<span title="PM XX">PM XX</span>
<span title="Star Charterist">Star Charterist</span>
<span title="Elephant Trainer">Elephant Trainer</span>
I have tried the following, but I keep getting either empty lists:
site = BeautifulSoup(link.text, "html.parser")
jobs_a = site.find_all("span title")
or
jobs_a = site.find_all("span", attrs="title")
or
jobs_a = site.find_all("span", attrs="title*")
Any suggestions?
I prefer using a CSS selector.
from bs4 import BeautifulSoup
data = '''\
<span title="PM XX">PM XX</span>
<span title="Star Charterist">Star Charterist</span>
<span title="Elephant Trainer">Elephant Trainer</span>
'''
soup = BeautifulSoup(data, 'html.parser')
for s in soup.select('span[title]'):
print(f"{s.text=}\t{s.attrs['title']=}")

Selenium & Python: Finding elements with dynamic XPATH

I am trying to extract the url from an href that is very specific, this site has many html routes that are VERY! similar and the only way to extract this url is by an XPATH built in the way I am doing it.
But the big issue is the following, it changes all the time, part of the label is static but the other is dynamic and it is kind of random
The html looks like this:
NOTE: page_name ="Laura" is a name I can select
# Option 1
<span label="answer by Laura to Charles">
# Option 2
<span label="answer by Laura to Nina">
# Option 3
<span label="answer by Laura to Maria">
<div >
<a href="www.thisisawebsite.otherthings.blabla...>
# Option n
<span label="answer by Laura to THIS COULD BE ANY RANDOM NAME">
<div >
<a href="www.thisisawebsite.otherthings.blabla...>
I have tried different options:
get_comment = WebDriverWait(self.driver, 2).until(
EC.presence_of_all_elements_located((
By.XPATH,
r'//span[contains(text(), "answer by {}")]/div/a'.format(page_name)))
)[0].get_attribute('href')
Other try:
get_comment = WebDriverWait(self.driver, 2).until(
EC.presence_of_all_elements_located((
By.XPATH,
r'//span[(#label="answer by {}")]/div/a'.format(page_name)))
)[0].get_attribute('href')
Second one should work if you change it to
get_comment = WebDriverWait(self.driver, 2).until(
EC.presence_of_all_elements_located((
By.XPATH,
r'//span[contains(#label,"answer by {}")]/div/a'.format(page_name)))
)[0].get_attribute('href')
When using '=', it searches for the exact same string. This allows you to only get part of it

Python/Selenium- Class element cannot be clicked

I am trying to scrape data from this page: https://www.oddsportal.com/soccer/chile/primera-division/curico-unido-o-higgins-CtsLggl6/#over-under;2
Here I am trying to expand all of the "compare odds" fields, which are contained in this HTML:
<div class="table-container">
<div class="table-header-light even"><strong>Over/Under +1 </strong><span class="avg chunk-odd-payout">93.4%</span><span class="avg chunk-odd nowrp">5.63</span>
<span
class="avg chunk-odd nowrp">1.12</span><span class="odds-cnt">(3)</span><span class="odds-co"><a class="more" href="" onclick="page.togleTableContent('P-1.00-0-0',this);return false;">Compare odds</a></span></div>
</div>
<div class="table-container" style="display: none;">
<div class="table-header-light"><strong>Over/Under +1.25 </strong><span class="avg chunk-odd-payout"></span><span class="avg chunk-odd nowrp"></span><span class="avg chunk-odd nowrp"></span>
<span
class="odds-cnt">(0)</span><span class="odds-co"><a class="more" href="" onclick="page.togleTableContent('P-1.25-0-0',this);return false;">Compare odds</a></span></div>
</div>
The part I am trying to access is the following:
span class="odds-co">Compare odds
I have tried all of the following:
#odds_rows = browser.find_elements_by_class_name('more')
# odds_rows=browser.find_elements_by_css_selector(".more")
# odds_rows=WebDriverWait(browser, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[#class='more']")))
odds_rows= WebDriverWait(browser, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".more")))
#odds_rows=WebDriverWait(browser, 10).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "more")))
In order to subsequently loop click through the identified fields:
for i in odds_rows:
#browser.execute_script("arguments[0].click();", i)
i.click()
However already in the step of identifying the fields I am getting a timeout error on all WebDriverWait attempts except for
odds_rows=WebDriverWait(browser, 10).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "more")))
This option yields only one result:
[<selenium.webdriver.remote.webelement.WebElement (session="7cbc57173a57aadbc115264dff8ca620", element="3654f928-bca4-4033-9566-da9e6aa6294b")>]
However this result is not clicked subsequently.
What am I doing wrong?
driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get("https://www.oddsportal.com/soccer/chile/primera-division/curico-unido-o-higgins-CtsLggl6/#over-under;2;0.50;0")
driver.maximize_window()
odds_rows = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.table-header-light')))
for i in odds_rows:
count = i.find_element_by_xpath('./span[#class="odds-cnt"]')
elem = i.find_elements_by_xpath('.//*[contains(text(),"Compare")]')
txt = count.text
if txt != '' and len(elem):
elem = elem[0]
driver.execute_script("arguments[0].scrollIntoView();", elem)
elem.click()
The issue is the row with count '' is not visible and you cannot click it .
if you click on 'Compare odds' you can see the URL that changes from
https://www.oddsportal.com/soccer/chile/primera-division/curico-unido-o-higgins-CtsLggl6/#over-under;2
https://www.oddsportal.com/soccer/chile/primera-division/curico-unido-o-higgins-CtsLggl6/#over-under;2;0.50;0
if you follow clicking you will se that the last part:
2;0.50;0 will increase by 0.5
next is
https://www.oddsportal.com/soccer/chile/primera-division/curico-unido-o-higgins-CtsLggl6/#over-under;2;1.00;0
https://www.oddsportal.com/soccer/chile/primera-division/curico-unido-o-higgins-CtsLggl6/#over-under;2;1.50;0
and continue...
In another way you have this class: "table-main detail-odds sortable" by default is hidden, because there is the data, you DON'T need to click. you can scrape that class
i hope be helpful for you.

Selenium Web Driver: How do I get a url from an element?

Using libraries Selenium/Splinter and trying to get the URL from each element to download pdf statements from wellsfargo. When scraping a table it provides links of pdf - looking to click on each link and then somehow download them to a location on the computer.
import selenium
from splinter import Browser
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('actual_path')
driver.get('https://www.wellsfargo.com/')
driver.delete_all_cookies
mainurl = "https://www.wellsfargo.com/"
# login function - working
username = driver.find_element_by_id("userid")
username.send_keys("actual_username")
passy = driver.find_element_by_id("password")
passy.send_keys("actual_password")
submitbutton = driver.find_element_by_xpath("""//*[#id="frmSignon"]/div[5]""")
driver.find_element_by_xpath('/html/body/div[3]/section/div[1]/div[3]/div[1]/div/div[1]/a[1]').click()
driver.implicitly_wait(sleeptime)
driver.find_element_by_link_text('View Statements').click()
################## NEED HELP -TO SAVE PDF ELEMENTS AND DOWNLOAD #############
elem = driver.find_elements_by_class_name("document-title")
counttotal = 0
for pdf in elem:
counttotal = counttotal + 1
elem[counttotal].click()
driver.back()
when trying to print for i in elem: print(i) - it prints the elements but not the url link, is there any way to get the link from this element?
# Sample Doc To Click & Download
<div class="documents"><div data-message-container="stmtdiscMessages"><!------------ Error messages -----------------><!----------- Account messages ---------------></div><h3>Statements</h3><p>Deposit account statements are available online for up to 7 years.</p><div class="document large"><div class="document-details account-introtext"> <a role="link" tabindex="0" data-pdf="true" data-url="https://connect.secure.wellsfargo.com/edocs/documents/retrieve/34278aaf-8f37-43de-7d8e-e368124d5f62?_x=gTHPa3PEVAvnSu-uI5vThRyJCGUu-2f4" class="document-title" style="touch-action: auto;">Statement 08/31/19 (21K, PDF)</a></div></div><div class="document large">
#document number 2
<div class="document-details account-introtext"> <a role="link" tabindex="0" data-pdf="true" data-url="https://connect.secure.wellsfargo.com/edocs/documents/retrieve/9efe2b61-8233-8s65-2738-677ef63291f7?_x=h8i20NifIc9dRVCvj9I8pkic0S80i" class="document-title" style="touch-action: auto;">Statement 07/31/19 (21K, PDF)</a></div></div><div class="document large">
#document number 3, etc.
<div class="document-details account-introtext"> <a role="link" tabindex="0" data-pdf="true" data-url="https://connect.secure.wellsfargo.com/edocs/documents/retrieve/7eece2e7-e27e-4445-8s4d-fa5899c5c96b?_x=037X7K-IdhVOVevUISRnQT74qL793tIW" class="document-title" style="touch-action: auto;">Statement 06/30/19 (24K, PDF)</a></div></div><div class="document large">
You can retrieve any attribute from an element with the get_attribute function:
elements = driver.find_elements_by_class_name("document-title")
pdf_urls = []
for element in elements:
pdf_urls.append(element.get_attribute('data-url'))
Or if you are used to list comprehensions, here's a more pythonic way:
elements = driver.find_elements_by_class_name("document-title")
pdf_urls = [element.get_attribute('data-url') for element in elements]

Looping over multiple tooltips

I am trying to get names and affiliations of authors from a series of articles from this page (you'll need to have access to Proquest to visualise it). What I want to do is to open all the tooltips present at the top of the page, and extract some HTML text from them. This is my code:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
browser = webdriver.Firefox()
url = 'http://search.proquest.com/econlit/docview/56607849/citation/2876523144F544E0PQ/3?accountid=13042'
browser.get(url)
#insert your username and password here
n_authors = browser.find_elements_by_class_name('zoom') #zoom is the class name of the three tooltips that I want to open in my loop
author = []
institution = []
for a in n_authors:
print(a)
ActionChains(browser).move_to_element(a).click().perform()
html_author = browser.find_element_by_xpath('//*[#id="authorResolveLinks"]/li/div/a').get_attribute('innerHTML')
html_institution = browser.find_element_by_xpath('//*[#id="authorResolveLinks"]/li/div/p').get_attribute('innerHTML')
author.append(html_author)
institution.append(html_institution)
Although n_authors has three entries that are apparently different from one another, selenium fails to get the info from all tooltips, instead returning this:
author
#['Nuttall, William J.',
#'Nuttall, William J.',
#'Nuttall, William J.']
And the same happens for the institution. What am I getting wrong? Thanks a lot
EDIT:
The array containing the xpaths of the tooltips:
n_authors
#[<selenium.webdriver.remote.webelement.WebElement (session="277c8abc-3883-
#43a8-9e93-235a8ded80ff", element="{008a2ade-fc82-4114-b1bf-cc014d41c40f}")>,
#<selenium.webdriver.remote.webelement.WebElement (session="277c8abc-3883-
#43a8-9e93-235a8ded80ff", element="{c4c2d89f-3b8a-42cc-8570-735a4bd56c07}")>,
#<selenium.webdriver.remote.webelement.WebElement (session="277c8abc-3883-
#43a8-9e93-235a8ded80ff", element="{9d06cb60-df58-4f90-ad6a-43afeed49a87}")>]
Which has length 3, and the three elements are different, which is why I don't understand why selenium won't distinguish them.
EDIT 2:
Here is the relevant HTML
<span class="titleAuthorETC small">
<span style="display:none" class="title">false</span>
Jamasb, Tooraj
<a class="zoom" onclick="return false;" href="#">
<img style="margin-left:4px; border:none" alt="Visualizza profilo" id="resolverCitation_previewTrigger_0" title="Visualizza profilo" src="/assets/r20161.1.0-4/ctx/images/scholarUniverse/ar_button.gif">
</a><script type="text/javascript">Tips.images = '/assets/r20161.1.0-4/pqc/javascript/prototip/images/prototip/';</script>; Nuttall, William J
<a class="zoom" onclick="return false;" href="#">
<img style="margin-left:4px; border:none" alt="Visualizza profilo" id="resolverCitation_previewTrigger_1" title="Visualizza profilo" src="/assets/r20161.1.0-4/ctx/images/scholarUniverse/ar_button.gif">
</a>; Pollitt, Michael G
<a class="zoom" onclick="return false;" href="#">
<img style="margin-left:4px; border:none" alt="Visualizza profilo" id="resolverCitation_previewTrigger_2" title="Visualizza profilo" src="/assets/r20161.1.0-4/ctx/images/scholarUniverse/ar_button.gif">
</a>.
UPDATE:
#parishodak's answer, for some reason does not work using Firefox, unless I manually hover over the tooltips first. It works with chromedriver, but only if I first hover over the tooltips, and only if I allow time.sleep(), as in
for i in itertools.count():
try:
tooltip = browser.find_element_by_xpath('//*[#id="resolverCitation_previewTrigger_' + str(i) + '"]')
print(tooltip)
ActionChains(browser).move_to_element(tooltip).perform() #
except NoSuchElementException:
break
time.sleep(2)
elements = browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a')
author = []
for e in elements:
print(e)
attribute = e.get_attribute('innerHTML')
author.append(attribute)`
The reason it is returning the same element, because xpath is not changing for all the loop iterations.
Two ways to deal:
Use array notation for xpath as described below:
browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a[1]').get_attribute('innerHTML')
browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a[2]').get_attribute('innerHTML')
browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a[3]').get_attribute('innerHTML')
Or
Instead of find_element_by_xpath use find_elements_by_xpath
elements = browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a')
loop over elements and use get_attribute('innerHTML') on each element in loop iteration.

Categories