Using libraries Selenium/Splinter and trying to get the URL from each element to download pdf statements from wellsfargo. When scraping a table it provides links of pdf - looking to click on each link and then somehow download them to a location on the computer.
import selenium
from splinter import Browser
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
driver = webdriver.Chrome('actual_path')
driver.get('https://www.wellsfargo.com/')
driver.delete_all_cookies
mainurl = "https://www.wellsfargo.com/"
# login function - working
username = driver.find_element_by_id("userid")
username.send_keys("actual_username")
passy = driver.find_element_by_id("password")
passy.send_keys("actual_password")
submitbutton = driver.find_element_by_xpath("""//*[#id="frmSignon"]/div[5]""")
driver.find_element_by_xpath('/html/body/div[3]/section/div[1]/div[3]/div[1]/div/div[1]/a[1]').click()
driver.implicitly_wait(sleeptime)
driver.find_element_by_link_text('View Statements').click()
################## NEED HELP -TO SAVE PDF ELEMENTS AND DOWNLOAD #############
elem = driver.find_elements_by_class_name("document-title")
counttotal = 0
for pdf in elem:
counttotal = counttotal + 1
elem[counttotal].click()
driver.back()
when trying to print for i in elem: print(i) - it prints the elements but not the url link, is there any way to get the link from this element?
# Sample Doc To Click & Download
<div class="documents"><div data-message-container="stmtdiscMessages"><!------------ Error messages -----------------><!----------- Account messages ---------------></div><h3>Statements</h3><p>Deposit account statements are available online for up to 7 years.</p><div class="document large"><div class="document-details account-introtext"> <a role="link" tabindex="0" data-pdf="true" data-url="https://connect.secure.wellsfargo.com/edocs/documents/retrieve/34278aaf-8f37-43de-7d8e-e368124d5f62?_x=gTHPa3PEVAvnSu-uI5vThRyJCGUu-2f4" class="document-title" style="touch-action: auto;">Statement 08/31/19 (21K, PDF)</a></div></div><div class="document large">
#document number 2
<div class="document-details account-introtext"> <a role="link" tabindex="0" data-pdf="true" data-url="https://connect.secure.wellsfargo.com/edocs/documents/retrieve/9efe2b61-8233-8s65-2738-677ef63291f7?_x=h8i20NifIc9dRVCvj9I8pkic0S80i" class="document-title" style="touch-action: auto;">Statement 07/31/19 (21K, PDF)</a></div></div><div class="document large">
#document number 3, etc.
<div class="document-details account-introtext"> <a role="link" tabindex="0" data-pdf="true" data-url="https://connect.secure.wellsfargo.com/edocs/documents/retrieve/7eece2e7-e27e-4445-8s4d-fa5899c5c96b?_x=037X7K-IdhVOVevUISRnQT74qL793tIW" class="document-title" style="touch-action: auto;">Statement 06/30/19 (24K, PDF)</a></div></div><div class="document large">
You can retrieve any attribute from an element with the get_attribute function:
elements = driver.find_elements_by_class_name("document-title")
pdf_urls = []
for element in elements:
pdf_urls.append(element.get_attribute('data-url'))
Or if you are used to list comprehensions, here's a more pythonic way:
elements = driver.find_elements_by_class_name("document-title")
pdf_urls = [element.get_attribute('data-url') for element in elements]
Related
Im trying to expand the arrow as below from https://www.xpi.com.br/investimentos/fundos-de-investimento/lista/#/
I'm usingo the code below:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
url = 'https://www.xpi.com.br/investimentos/fundos-de-investimento/lista/#/'
driver = webdriver.Chrome(options=options)
driver.get(url)
sleep(1)
expandir = driver.find_elements_by_class_name("sly-row")[-4]
expandir.click()
sleep(4)
expandir_fundo = wait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div[class='arrow-details']")))
expandir_fundo.click()
And i'm getting the error: TimeoutException: Message
I tried to use the code below too:
expandir_fundo = driver.find_element_by_xpath('//*[#id="investment-funds"]/div/div/div[2]/article/article/section[2]/div[1]/div[10]')
expandir_fundo.click()
and got the error: ElementClickInterceptedException: Message: element click intercepted: Element ... is not clickable at point (1479, 8).
find below part of HTML:
<div class="funds-table-row sc-jAaTju kdqiDh sc-ckVGcZ drRPta">
<div class="fund-name sc-jKJlTe hZlCDP" title="Bahia AM Maraú Advisory FIC de FIM" style="cursor:pointer;"><div>Bahia AM Maraú Advisory FIC de F...</div><p class="sc-brqgnPfbcFSC">Multimercado</p></div>
<div class="morningstar sc-jKJlTe hZlCDP">-</div>
<div class="minimal-initial-investment sc-jKJlTe hZlCDP">20.000</div>
<div class="administration-rate sc-jKJlTe hZlCDP">1,90</div>
<div class="redemption-quotation sc-jKJlTe hZlCDP"><div>D+30<p class="sc-brqgnP fbcFSC" style="font-size: 12px; color: rgb(24, 25, 26);">(Dias Corridos)</p></div></div>
<div class="redemption-settlement sc-jKJlTe hZlCDP"><div>D+1<p class="sc-brqgnP fbcFSC" style="font-size: 12px; color: rgb(24, 25, 26);">(Dias Úteis)</p></div></div>
<div class="risk sc-jKJlTe hZlCDP"><span class="badge-suitability color-neutral-dark-pure sc-jWBwVP hvQuvX" style="background-color: rgb(215, 123, 10);">8<span><strong>Perfil Médio</strong><br>A nova pontuação de risco leva em consideração critérios de risco, mercado e liquidez. Para saber mais, clique aqui.</span></span></div>
<div class="profitability sc-jKJlTe hZlCDP"><div class="sc-kEYyzF lnwNVR"></div><div class="sc-kkGfuU jBBLoV"><div class="sc-jKJlTe hZlCDP">0,92</div><div class="sc-jKJlTe hZlCDP">0,48</div><div class="sc-jKJlTe hZlCDP">5,03</div></div></div><div class="invest-button sc-jKJlTe hZlCDP"><button class="xp__button xp__button--small" data-wa="fundos-de-investimento; listagem - investir; investir Bahia AM Maraú Advisory FIC de FIM">Investir</button></div>
<div class="arrow-details sc-jKJlTe hZlCDP"><i type="arrow" data-wa="" class="eab2eu-0 eUjuAo"></i></div></div>
The HTML "arrow" is:
<div class="arrow-details sc-jKJlTe hZlCDP">
<i type="arrow" data-wa="" class="eab2eu-0 eUjuAo">
::before
</i>
</div>
You can check the below lines of code
option = Options()
#Disable the notification popUp
option.add_argument("--disable-notifications")
driver = webdriver.Chrome(r"ChromeDriverPath",chrome_options=option)
driver.get("https://www.xpi.com.br/investimentos/fundos-de-investimento/lista/#/")
#Clicked on the cookie, which is under the shadow-DOM So used execute_script(), Also used sleep() before clicking on cookies because at some point it throws the JS Error Shadow-DOM null.
sleep(5)
cookie_btn = driver.execute_script('return document.querySelector("#cookies-policy-container").shadowRoot.querySelector("soma-context > cookies-policy-disclaimer > div > soma-card > div > div:nth-child(2) > soma-button").shadowRoot.querySelector("button")')
cookie_btn.click()
#There can be a multiple way to scroll, Below is one of them
driver.execute_script("window.scrollBy(0,700)")
#There are multiple rows which have expand button so used the index of the XPath if you want to click on multiple try to use loop
expandir_fundo = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.XPATH, "((//*[#class='eab2eu-0 eUjuAo'])[1])")))
expandir_fundo.click()
import
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.chrome.options import Options
While #YaDav MaNish answer seems to be working, I would rather use customized query Selector to click on accept cookies button, not generated by browser.
cookie_btn = driver.execute_script('return document.querySelector("#cookies-policy-container").shadowRoot.querySelector("soma-context soma-card soma-button").shadowRoot.querySelector("button")')
cookie_btn.click()
should work.
Your replies and the cookies code, gave me an ideia to usa the execute_script, to solve my problem.
Find below My code
This first page I only open the URL, remove the cookies and expand all information.
# Setting options
options = Options()
options.add_argument('--window-size=1900,1000')
options.add_argument("--disable-notifications")
# Opening Website
url = 'https://www.xpi.com.br/investimentos/fundos-de-investimento/lista/#/'
driver = webdriver.Chrome(r"chromedriver", options=options)
driver.get(url)
# Removing Cookies
cookie_btn = driver.execute_script('return document.querySelector("#cookies-policy-container").'\
'shadowRoot.querySelector("soma-context > cookies-policy-disclaimer > div > soma-card > div > '\
'div:nth-child(2) > soma-button").shadowRoot.querySelector("button")')
cookie_btn.click()
# Expanding the whole list
expandir = driver.find_elements_by_class_name("sly-row")[-4]
expandir.click()
sleep(4)
# Creating content
page_content = driver.page_source
site = BeautifulSoup(page_content, 'html.parser')
This second part I used driver.execute_sript to open the arrow as I went getting information.
fundos = site.find_all('div',class_='funds-table-row')
for cont, fundo in enumerate(fundos):
nome = fundo.find('div', class_='fund-name')['title']
driver.execute_script(f"return document.getElementsByClassName('arrow-details'){[cont + 1]}.click()")
page_detalhe = driver.page_source
site2 = BeautifulSoup(page_detalhe, 'html.parser')
details= site2.find('section', class_='has-documents').find_all('div')
tax_id = details[2].get_text()
Thanks for everyone for help.
I am trying to scrape data from this page: https://www.oddsportal.com/soccer/chile/primera-division/curico-unido-o-higgins-CtsLggl6/#over-under;2
Here I am trying to expand all of the "compare odds" fields, which are contained in this HTML:
<div class="table-container">
<div class="table-header-light even"><strong>Over/Under +1 </strong><span class="avg chunk-odd-payout">93.4%</span><span class="avg chunk-odd nowrp">5.63</span>
<span
class="avg chunk-odd nowrp">1.12</span><span class="odds-cnt">(3)</span><span class="odds-co"><a class="more" href="" onclick="page.togleTableContent('P-1.00-0-0',this);return false;">Compare odds</a></span></div>
</div>
<div class="table-container" style="display: none;">
<div class="table-header-light"><strong>Over/Under +1.25 </strong><span class="avg chunk-odd-payout"></span><span class="avg chunk-odd nowrp"></span><span class="avg chunk-odd nowrp"></span>
<span
class="odds-cnt">(0)</span><span class="odds-co"><a class="more" href="" onclick="page.togleTableContent('P-1.25-0-0',this);return false;">Compare odds</a></span></div>
</div>
The part I am trying to access is the following:
span class="odds-co">Compare odds
I have tried all of the following:
#odds_rows = browser.find_elements_by_class_name('more')
# odds_rows=browser.find_elements_by_css_selector(".more")
# odds_rows=WebDriverWait(browser, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[#class='more']")))
odds_rows= WebDriverWait(browser, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".more")))
#odds_rows=WebDriverWait(browser, 10).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "more")))
In order to subsequently loop click through the identified fields:
for i in odds_rows:
#browser.execute_script("arguments[0].click();", i)
i.click()
However already in the step of identifying the fields I am getting a timeout error on all WebDriverWait attempts except for
odds_rows=WebDriverWait(browser, 10).until(EC.visibility_of_all_elements_located((By.CLASS_NAME, "more")))
This option yields only one result:
[<selenium.webdriver.remote.webelement.WebElement (session="7cbc57173a57aadbc115264dff8ca620", element="3654f928-bca4-4033-9566-da9e6aa6294b")>]
However this result is not clicked subsequently.
What am I doing wrong?
driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get("https://www.oddsportal.com/soccer/chile/primera-division/curico-unido-o-higgins-CtsLggl6/#over-under;2;0.50;0")
driver.maximize_window()
odds_rows = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.table-header-light')))
for i in odds_rows:
count = i.find_element_by_xpath('./span[#class="odds-cnt"]')
elem = i.find_elements_by_xpath('.//*[contains(text(),"Compare")]')
txt = count.text
if txt != '' and len(elem):
elem = elem[0]
driver.execute_script("arguments[0].scrollIntoView();", elem)
elem.click()
The issue is the row with count '' is not visible and you cannot click it .
if you click on 'Compare odds' you can see the URL that changes from
https://www.oddsportal.com/soccer/chile/primera-division/curico-unido-o-higgins-CtsLggl6/#over-under;2
https://www.oddsportal.com/soccer/chile/primera-division/curico-unido-o-higgins-CtsLggl6/#over-under;2;0.50;0
if you follow clicking you will se that the last part:
2;0.50;0 will increase by 0.5
next is
https://www.oddsportal.com/soccer/chile/primera-division/curico-unido-o-higgins-CtsLggl6/#over-under;2;1.00;0
https://www.oddsportal.com/soccer/chile/primera-division/curico-unido-o-higgins-CtsLggl6/#over-under;2;1.50;0
and continue...
In another way you have this class: "table-main detail-odds sortable" by default is hidden, because there is the data, you DON'T need to click. you can scrape that class
i hope be helpful for you.
I have been trying hours to sort this out but unable to do so.
Here is my script using Selenium Webdriver in Python, trying to extract title, date, and link. I am able to extract the title and link. However, I am stuck at extracting the date. Could someone please help me with this. Much appreciated your response.
import selenium.webdriver
import pandas as pd
frame=[]
url = "https://www.oric.gov.au/publications/media-releases"
driver = selenium.webdriver.Chrome("C:/Users/[Computer_Name]/Downloads/chromedriver.exe")
driver.get(url)
all_div = driver.find_elements_by_xpath('//div[contains(#class, "ui-accordion-content")]')
for div in all_div:
all_items = div.find_elements_by_tag_name("a")
for item in all_items:
title = item.get_attribute('textContent')
link = item.get_attribute('href')
date =
frame.append({
'title': title,
'date': date,
'link': link,
})
dfs = pd.DataFrame(frame)
dfs.to_csv('myscraper.csv',index=False,encoding='utf-8-sig')
Here is the html I am interested in:
<div id="ui-accordion-1-panel-0" ...>
<div class="views-field views-field-title">
<span class="field-content">
<a href="/publications/media-release/ngadju-corporation-emerges-special-administration-stronger">
Ngadju corporation emerges from special administration stronger
</a>
</span>
</div>
<div class="views-field views-field-field-document-media-release-no">
<div class="field-content"><span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2020-07-31T00:00:00+10:00">
31 July 2020
</span> (MR2021-06)</div>
</div>
</div>
...
I'd get all rows first.
from pprint import pprint
import selenium.webdriver
frame = []
url = "https://www.oric.gov.au/publications/media-releases"
driver = selenium.webdriver.Chrome()
driver.get(url)
divs = driver.find_elements_by_css_selector('div.ui-accordion-content')
for div in divs:
rows = div.find_elements_by_css_selector('div.views-row')
for row in rows:
item = row.find_element_by_tag_name('a')
title = item.get_attribute('textContent')
link = item.get_attribute('href')
date = row.find_element_by_css_selector(
'span.date-display-single').get_attribute('textContent')
frame.append({
'title': title,
'date': date,
'link': link,
})
driver.quit()
pprint(frame)
print(len(frame))
Ok just search for the <span> with the property dc:date, save it in a WebElement dateElement and take its text dateElement.text. That's your date as string.
How can I read text of <dd> tag which has <dt> like Commodity code.
`<dl class="dl">
<dt>Trading Screen Product Name</dt>
<dd>Biodiesel Futures (balmo)</dd>
<dt>Trading Screen Hub Name</dt>
<dd>Soybean Oil Pen 1st Line</dd>
<dt>Commodity Code</dt>
<dd><div>S25-S2Z</div></dd>
<dt>Contract Size</dt>
<dd><div>100 metric tonnes (220,462 pounds)</div></dd>
</dl>`
from selenium import webdriver
driver = webdriver.Chrome("C:\\Python36-32\\selenium\\webdriver\\chromedriver.exe")
link_list = ["http://www.theice.com/products/31500922","http://www.theice.com/products/243"]
driver.maximize_window()
for link in link_list:
driver.get(link)
desc_list = driver.find_elements_by_class_name("dl")
Try to implement below code to get the values of "Commodity Code" as output:
for desc_list in driver.find_elements_by_class_name("dl"):
print(desc_list.find_element_by_xpath("./dt[.='Commodity Code']/following-sibling::dd").text)
To extract/read the text of <dd> tag which has child <div> tag e.g. S25-S2Z you can create a List of the desired elements and then print the text within the elements and you can use the following solution:
for element in driver.find_elements_by_xpath("//dl[#class='dl']//dd/div"):
list_dd.get_attribute("innerHTML")
I am trying to get names and affiliations of authors from a series of articles from this page (you'll need to have access to Proquest to visualise it). What I want to do is to open all the tooltips present at the top of the page, and extract some HTML text from them. This is my code:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
browser = webdriver.Firefox()
url = 'http://search.proquest.com/econlit/docview/56607849/citation/2876523144F544E0PQ/3?accountid=13042'
browser.get(url)
#insert your username and password here
n_authors = browser.find_elements_by_class_name('zoom') #zoom is the class name of the three tooltips that I want to open in my loop
author = []
institution = []
for a in n_authors:
print(a)
ActionChains(browser).move_to_element(a).click().perform()
html_author = browser.find_element_by_xpath('//*[#id="authorResolveLinks"]/li/div/a').get_attribute('innerHTML')
html_institution = browser.find_element_by_xpath('//*[#id="authorResolveLinks"]/li/div/p').get_attribute('innerHTML')
author.append(html_author)
institution.append(html_institution)
Although n_authors has three entries that are apparently different from one another, selenium fails to get the info from all tooltips, instead returning this:
author
#['Nuttall, William J.',
#'Nuttall, William J.',
#'Nuttall, William J.']
And the same happens for the institution. What am I getting wrong? Thanks a lot
EDIT:
The array containing the xpaths of the tooltips:
n_authors
#[<selenium.webdriver.remote.webelement.WebElement (session="277c8abc-3883-
#43a8-9e93-235a8ded80ff", element="{008a2ade-fc82-4114-b1bf-cc014d41c40f}")>,
#<selenium.webdriver.remote.webelement.WebElement (session="277c8abc-3883-
#43a8-9e93-235a8ded80ff", element="{c4c2d89f-3b8a-42cc-8570-735a4bd56c07}")>,
#<selenium.webdriver.remote.webelement.WebElement (session="277c8abc-3883-
#43a8-9e93-235a8ded80ff", element="{9d06cb60-df58-4f90-ad6a-43afeed49a87}")>]
Which has length 3, and the three elements are different, which is why I don't understand why selenium won't distinguish them.
EDIT 2:
Here is the relevant HTML
<span class="titleAuthorETC small">
<span style="display:none" class="title">false</span>
Jamasb, Tooraj
<a class="zoom" onclick="return false;" href="#">
<img style="margin-left:4px; border:none" alt="Visualizza profilo" id="resolverCitation_previewTrigger_0" title="Visualizza profilo" src="/assets/r20161.1.0-4/ctx/images/scholarUniverse/ar_button.gif">
</a><script type="text/javascript">Tips.images = '/assets/r20161.1.0-4/pqc/javascript/prototip/images/prototip/';</script>; Nuttall, William J
<a class="zoom" onclick="return false;" href="#">
<img style="margin-left:4px; border:none" alt="Visualizza profilo" id="resolverCitation_previewTrigger_1" title="Visualizza profilo" src="/assets/r20161.1.0-4/ctx/images/scholarUniverse/ar_button.gif">
</a>; Pollitt, Michael G
<a class="zoom" onclick="return false;" href="#">
<img style="margin-left:4px; border:none" alt="Visualizza profilo" id="resolverCitation_previewTrigger_2" title="Visualizza profilo" src="/assets/r20161.1.0-4/ctx/images/scholarUniverse/ar_button.gif">
</a>.
UPDATE:
#parishodak's answer, for some reason does not work using Firefox, unless I manually hover over the tooltips first. It works with chromedriver, but only if I first hover over the tooltips, and only if I allow time.sleep(), as in
for i in itertools.count():
try:
tooltip = browser.find_element_by_xpath('//*[#id="resolverCitation_previewTrigger_' + str(i) + '"]')
print(tooltip)
ActionChains(browser).move_to_element(tooltip).perform() #
except NoSuchElementException:
break
time.sleep(2)
elements = browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a')
author = []
for e in elements:
print(e)
attribute = e.get_attribute('innerHTML')
author.append(attribute)`
The reason it is returning the same element, because xpath is not changing for all the loop iterations.
Two ways to deal:
Use array notation for xpath as described below:
browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a[1]').get_attribute('innerHTML')
browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a[2]').get_attribute('innerHTML')
browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a[3]').get_attribute('innerHTML')
Or
Instead of find_element_by_xpath use find_elements_by_xpath
elements = browser.find_elements_by_xpath('//*[#id="authorResolveLinks"]/li/div/a')
loop over elements and use get_attribute('innerHTML') on each element in loop iteration.