Related
Im trying to expand the arrow as below from https://www.xpi.com.br/investimentos/fundos-de-investimento/lista/#/
I'm usingo the code below:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
from time import sleep
url = 'https://www.xpi.com.br/investimentos/fundos-de-investimento/lista/#/'
driver = webdriver.Chrome(options=options)
driver.get(url)
sleep(1)
expandir = driver.find_elements_by_class_name("sly-row")[-4]
expandir.click()
sleep(4)
expandir_fundo = wait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div[class='arrow-details']")))
expandir_fundo.click()
And i'm getting the error: TimeoutException: Message
I tried to use the code below too:
expandir_fundo = driver.find_element_by_xpath('//*[#id="investment-funds"]/div/div/div[2]/article/article/section[2]/div[1]/div[10]')
expandir_fundo.click()
and got the error: ElementClickInterceptedException: Message: element click intercepted: Element ... is not clickable at point (1479, 8).
find below part of HTML:
<div class="funds-table-row sc-jAaTju kdqiDh sc-ckVGcZ drRPta">
<div class="fund-name sc-jKJlTe hZlCDP" title="Bahia AM Maraú Advisory FIC de FIM" style="cursor:pointer;"><div>Bahia AM Maraú Advisory FIC de F...</div><p class="sc-brqgnPfbcFSC">Multimercado</p></div>
<div class="morningstar sc-jKJlTe hZlCDP">-</div>
<div class="minimal-initial-investment sc-jKJlTe hZlCDP">20.000</div>
<div class="administration-rate sc-jKJlTe hZlCDP">1,90</div>
<div class="redemption-quotation sc-jKJlTe hZlCDP"><div>D+30<p class="sc-brqgnP fbcFSC" style="font-size: 12px; color: rgb(24, 25, 26);">(Dias Corridos)</p></div></div>
<div class="redemption-settlement sc-jKJlTe hZlCDP"><div>D+1<p class="sc-brqgnP fbcFSC" style="font-size: 12px; color: rgb(24, 25, 26);">(Dias Úteis)</p></div></div>
<div class="risk sc-jKJlTe hZlCDP"><span class="badge-suitability color-neutral-dark-pure sc-jWBwVP hvQuvX" style="background-color: rgb(215, 123, 10);">8<span><strong>Perfil Médio</strong><br>A nova pontuação de risco leva em consideração critérios de risco, mercado e liquidez. Para saber mais, clique aqui.</span></span></div>
<div class="profitability sc-jKJlTe hZlCDP"><div class="sc-kEYyzF lnwNVR"></div><div class="sc-kkGfuU jBBLoV"><div class="sc-jKJlTe hZlCDP">0,92</div><div class="sc-jKJlTe hZlCDP">0,48</div><div class="sc-jKJlTe hZlCDP">5,03</div></div></div><div class="invest-button sc-jKJlTe hZlCDP"><button class="xp__button xp__button--small" data-wa="fundos-de-investimento; listagem - investir; investir Bahia AM Maraú Advisory FIC de FIM">Investir</button></div>
<div class="arrow-details sc-jKJlTe hZlCDP"><i type="arrow" data-wa="" class="eab2eu-0 eUjuAo"></i></div></div>
The HTML "arrow" is:
<div class="arrow-details sc-jKJlTe hZlCDP">
<i type="arrow" data-wa="" class="eab2eu-0 eUjuAo">
::before
</i>
</div>
You can check the below lines of code
option = Options()
#Disable the notification popUp
option.add_argument("--disable-notifications")
driver = webdriver.Chrome(r"ChromeDriverPath",chrome_options=option)
driver.get("https://www.xpi.com.br/investimentos/fundos-de-investimento/lista/#/")
#Clicked on the cookie, which is under the shadow-DOM So used execute_script(), Also used sleep() before clicking on cookies because at some point it throws the JS Error Shadow-DOM null.
sleep(5)
cookie_btn = driver.execute_script('return document.querySelector("#cookies-policy-container").shadowRoot.querySelector("soma-context > cookies-policy-disclaimer > div > soma-card > div > div:nth-child(2) > soma-button").shadowRoot.querySelector("button")')
cookie_btn.click()
#There can be a multiple way to scroll, Below is one of them
driver.execute_script("window.scrollBy(0,700)")
#There are multiple rows which have expand button so used the index of the XPath if you want to click on multiple try to use loop
expandir_fundo = WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.XPATH, "((//*[#class='eab2eu-0 eUjuAo'])[1])")))
expandir_fundo.click()
import
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.chrome.options import Options
While #YaDav MaNish answer seems to be working, I would rather use customized query Selector to click on accept cookies button, not generated by browser.
cookie_btn = driver.execute_script('return document.querySelector("#cookies-policy-container").shadowRoot.querySelector("soma-context soma-card soma-button").shadowRoot.querySelector("button")')
cookie_btn.click()
should work.
Your replies and the cookies code, gave me an ideia to usa the execute_script, to solve my problem.
Find below My code
This first page I only open the URL, remove the cookies and expand all information.
# Setting options
options = Options()
options.add_argument('--window-size=1900,1000')
options.add_argument("--disable-notifications")
# Opening Website
url = 'https://www.xpi.com.br/investimentos/fundos-de-investimento/lista/#/'
driver = webdriver.Chrome(r"chromedriver", options=options)
driver.get(url)
# Removing Cookies
cookie_btn = driver.execute_script('return document.querySelector("#cookies-policy-container").'\
'shadowRoot.querySelector("soma-context > cookies-policy-disclaimer > div > soma-card > div > '\
'div:nth-child(2) > soma-button").shadowRoot.querySelector("button")')
cookie_btn.click()
# Expanding the whole list
expandir = driver.find_elements_by_class_name("sly-row")[-4]
expandir.click()
sleep(4)
# Creating content
page_content = driver.page_source
site = BeautifulSoup(page_content, 'html.parser')
This second part I used driver.execute_sript to open the arrow as I went getting information.
fundos = site.find_all('div',class_='funds-table-row')
for cont, fundo in enumerate(fundos):
nome = fundo.find('div', class_='fund-name')['title']
driver.execute_script(f"return document.getElementsByClassName('arrow-details'){[cont + 1]}.click()")
page_detalhe = driver.page_source
site2 = BeautifulSoup(page_detalhe, 'html.parser')
details= site2.find('section', class_='has-documents').find_all('div')
tax_id = details[2].get_text()
Thanks for everyone for help.
I want to scrape codes only from below table using python
As in the Image, You can see I just want to scrape CPT, CTC, PTC, STC, SPT, HTC, P5TC, P1A, P2A P3A, P1E, P2E, P3E. This codes may change from time to time like the addition of P4E or removal of P1E.
HTML code for above table is:
<table class="list">
<tbody>
<tr>
<td>
<p>PRODUCT<br>DESCRIPTION</p>
</td>
<td>
<p><strong>Time Charter:</strong> CPT, CTC, PTC, STC, SPT, HTC, P5TC<br><strong>Time Charter Trip:</strong> P1A, P2A, P3A,<br>P1E, P2E, P3E</p>
</td>
<td><strong>Voyage: </strong>C3E, C4E, C5E, C7E</td>
</tr>
<tr>
<td>
<p>CONTRACT SIZE</p>
<p></p>
</td>
<td>
<p>1 day</p>
</td>
<td>
<p>1,000 metric tons</p>
</td>
</tr>
<tr>
<td>
<p>MINIMUM TICK</p>
<p></p>
</td>
<td>
<p>US$ 25</p>
</td>
<td>
<p>US$ 0.01</p>
</td>
</tr>
<tr>
<td>
<p>FINAL SETTLEMENT PRICE</p>
<p></p>
</td>
<td colspan="2" rowspan="1">
<p>The floating price will be the end-of-day price as supplied by the Baltic Exchange.</p>
<p><br><strong>All products:</strong> Final settlement price will be the mean of the daily Baltic Exchange spot price assessments for every trading day in the expiry month.</p>
<p><br><strong>Exception for P1A, P2A, P3A:</strong> Final settlement price will be the mean of the last 7 Baltic Exchange spot price assessments in the expiry month.</p>
</td>
</tr>
<tr>
<td>
<p>CONTRACT SERIES</p>
</td>
<td colspan="2" rowspan="1">
<p><strong><strong>CTC, CPT, PTC, STC, SPT, HTC, P5TC</strong>:</strong> Months, quarters and calendar years out to a maximum of 72 months</p>
<p><strong>C3E, C4E, C5E, C7E, P1A, P2A, P3A, P1E, P2E, P3E:</strong> Months, quarters and calendar years out to a maximum of 36 months</p>
</td>
</tr>
<tr>
<td>
<p>SETTLEMENT</p>
</td>
<td colspan="2" rowspan="1">
<p>At 13:00 hours (UK time) on the last business day of each month within the contract series</p>
</td>
</tr>
</tbody>
</table>
You can see code from below link of website
https://www.eex.com/en/products/global-commodities/freight
If your usecase is to scrape all the text:
You you have to induce WebDriverWait for the desired visibility_of_element_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get('https://www.eex.com/en/products/global-commodities/freight')
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p"))).text)
Using XPATH:
driver.get('https://www.eex.com/en/products/global-commodities/freight')
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p"))).text)
Console Output:
Time Charter: CPT, CTC, PTC, STC, SPT, HTC, P5TC
Time Charter Trip: P1A, P2A, P3A,
P1E, P2E, P3E
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Update 1
If you want to extract CPT, CTC, PTC, STC, SPT, HTC, P5TC and P1A, P2A, P3A and P1E, P2E, P3E individually, you can use the following solutions:
Printing CPT, CTC, PTC, STC, SPT, HTC, P5TC
#element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p")))
element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
print(driver.execute_script('return arguments[0].childNodes[1].textContent;', element).strip())
Printing P1A, P2A P3A
#element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p")))
element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
print(driver.execute_script('return arguments[0].childNodes[4].textContent;', element).strip())
Printing P1E, P2E, P3E
//element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p")))
element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
print(driver.execute_script('return arguments[0].lastChild.textContent;', element).strip())
Update 2
To print all the items together:
Code Block:
driver.get('https://www.eex.com/en/products/global-commodities/freight')
element = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
first = driver.execute_script('return arguments[0].childNodes[1].textContent;', element).strip()
second = driver.execute_script('return arguments[0].childNodes[4].textContent;', element).strip()
third = driver.execute_script('return arguments[0].lastChild.textContent;', element).strip()
for list in (first,second,third):
print(list)
Console Output:
CPT, CTC, PTC, STC, SPT, HTC, P5TC
P1A, P2A, P3A,
P1E, P2E, P3E
If variable txt contains HTML from your question, then this script extracts all required codes:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
text = soup.select_one('td:contains("Time Charter:")').text
codes = re.findall(r'[A-Z\d]{3}', text)
print(codes)
Prints:
['CPT', 'CTC', 'PTC', 'STC', 'SPT', 'HTC', 'P5T', 'P1A', 'P2A', 'P3A', 'P1E', 'P2E', 'P3E']
EDIT: To get codes from all tables, you can use this script:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
all_codes = []
for td in soup.select('td:contains("Time Charter:")'):
all_codes.extend(re.findall(r'[A-Z\d]{3}', td.text))
print(all_codes)
how to traverse all tr to give values in td.In my code it is overriding same tr/td.
My table.
#qty to add
<tbody id="gridview-1161-body">
<tr id="gridview-1161-record-19842148" data-boundview="gridview-1161" class="x-grid-row x-grid-data-row" tabindex="-1">
<td role="gridcell" class="x-grid-cell x-grid-td x-grid-cell-headerId-gridcolumn-1158 rp-grid-editable-cell rp-grid-editable-cell" id="ext-gen2535">
<div class="x-grid-cell-inner " style="text-align:right;">
<div class="rp-invalid-cell rp-icon-alert-require-field">
</div>
<input id="numberfield-1243-inputEl" type="text" role="spinbutton" name="Quantity" class="x-form-field x-form-text x-form-focus x-field-form-focus x-field-default-form-focus" autocomplete="off" style="width: 100%;">
</div></td>
</tr>
same like
<tr>..</tr></tbody>
Here all the id's are dynamically generating via code.
My python code:
#add qty
rowCount=len(driver.find_elements_by_xpath("//tbody[#id='gridview-1161-body']/tr"));
print(rowCount)
for row in rowCount:
element = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "td.x-grid-cell.x-grid-td.rp-grid-editable-cell[role='gridcell']")))
element.click()
time.sleep(2)
#input box to give qty-working for this id
driver.find_element(By.ID, "numberfield-1243-inputEl").send_keys('10')
driver.find_element(By.ID, "numberfield-1243-inputEl").send_keys(Keys.ENTER)
due to dynamic id i can't give find_element(By.ID)So i am using CSS_SELECTOR to find the td,but it is overriding same td..how to give tr.next to traverse all tr in table ?
To handle dynamic ID Induce WebDriverWait() and visibility_of_all_elements_located()
and Following XPATH option.
driver=webdriver.Chrome()
rows=WebDriverWait(driver,15).until(EC.visibility_of_all_elements_located((By.XPATH,"//tbody[contains(#id,'-body')]//tr[#class='x-grid-row x-grid-data-row']")))
for rownum in range(len(rows)):
#To avoid stale exception re-assign rows elements again
rows = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//tbody[contains(#id,'-body')]//tr[#class='x-grid-row x-grid-data-row']")))
element=rows[rownum].find_element_by_xpath(".//td[contains(#class,'rp-grid-editable-cell rp-grid-editable-cell') and #role='gridcell']")
element.click()
input=rows[rownum].find_element_by_xpath(".//input[#name='Quantity' and #role='spinbutton']")
input.send_keys('10')
input.send_keys(Keys.ENTER)
Get all rows, then find child td and input:
wait = WebDriverWait(driver, 20)
rows = wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "tr[id*='-record-'].x-grid-data-row")))
for row in rows:
row.find_element_by_css_selector("td[role='gridcell']").click()
row.find_element_by_name("Quantity").send_keys("10", Keys.ENTER)
Second way with xpath and index:
wait = WebDriverWait(driver, 10)
row_locator = "(//tr[contains(#id,'-record-')])"
rows_len = len(wait.until(EC.presence_of_all_elements_located((By.XPATH, row_locator))))
for i in range(1, rows_len + 1):
wait.until(EC.element_to_be_clickable((By.XPATH, f"{row_locator}[{i}]/td[#role='gridcell']"))).click()
driver.find_element_by_xpath(f"{row_locator}[{i}]/input[#name='Quantity']").send_keys("10", Keys.ENTER)
To traverse all the child <input> tags with in the ancestor <tr> tags to send character sequence you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use the following Locator Strategies:
for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//tbody[starts-with(#id, 'gridview') and contains(#id, '-body')]/tr/td//input[#name='Quantity' and starts-with(#id, 'numberfield-')]"))):
element.send_keys('10')
element.send_keys(Keys.ENTER)
I've written a script in python in combination with selenium along with BeautifulSoup to get the links leading to property details from a webpage. As the content are heavily dynamic, I made use of selenium to get the page source. When I run my script, I get lots of links including those required links.
How can I get only the relevant link from each container out of the three?
My try:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def fetch_info(link):
driver.get(link)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#community-search-homes .propertyWrapper > a")))
soup = BeautifulSoup(driver.page_source, "lxml")
linklist = [item.get("href") for item in soup.select("#community-search-homes .propertyWrapper > a")]
return linklist
if __name__ == '__main__':
url = "https://www.khov.com/find-new-homes/arizona/buckeye"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
for newlink in fetch_info(url):
print(newlink)
driver.quit()
Results I'm having:
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/affinity-at-verrado
/find-new-homes/arizona/buckeye/85396/four-seasons/k.-hovnanian's-four-seasons-at-victory-at-verrado
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/summit-at-silverstone
/find-new-homes/arizona/scottsdale/85257/k-hovnanian-homes/skye
/find-new-homes/arizona/phoenix/85020/k-hovnanian-homes/pointe-16
/find-new-homes/arizona/peoria/85383/k-hovnanian-homes/fusion-ii-at-the-meadows
/find-new-homes/arizona/scottsdale/85257/k-hovnanian-homes/aire
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/pinnacle-at-silverstone
/find-new-homes/arizona/peoria/85383/k-hovnanian-homes/montage-at-the-meadows
/find-new-homes/arizona/sun-city/85373/four-seasons/k.-hovnanian-s-four-seasons-at-ventana-lakes
/find-new-homes/arizona/peoria/85382/k-hovnanian-homes/park-paseo
/find-new-homes/arizona/laveen/85339/k-hovnanian-homes/affinity-at-montana-vista
/find-new-homes/arizona/laveen/85339/k-hovnanian-homes/aspire-at-montana-vista
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/pinnacle-ii-at-silverstone
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/summit-ii-at-silverstone
Results I would like to get:
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/affinity-at-verrado
/find-new-homes/arizona/buckeye/85396/four-seasons/k.-hovnanian's-four-seasons-at-victory-at-verrado
A chunk of html elements (the link I'm after is within the second line of the following elements):
<div class="propertyWrapper clear">
<span class="link-outside"></span>
<div class="propertyCarouselWrapper">
<div class="responsiveImageCarousel enabled" style="touch-action: pan-y; user-select: none; -webkit-user-drag: none; -webkit-tap-highlight-color: rgba(0, 0, 0, 0);">
<div class="prevBtn"></div>
<div class="nextBtn"></div>
<div class="images" data-detail-url="/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills">
<ul style="width: 960px; left: 0px;">
<li style="width: 320px;"><img alt="holiday exterior new homes sienna hills usp" src="https://khovcachecdn.azureedge.net/azure/sitefinitylibraries/images/default-source/images/az/aspire-at-sienna-hills/community-thumbnails/holiday-exterior-new-homes-sienna-hills-usp.jpg?sfvrsn=4&build=1019&encoder=wic&useresizingpipeline=true&w=450&h=280&mode=crop"></li>
<li style="width: 320px;"><img alt="carnival exterior new homes sienna hills usp" src="https://khovcachecdn.azureedge.net/azure/sitefinitylibraries/images/default-source/images/az/aspire-at-sienna-hills/community-thumbnails/carnival-exterior-new-homes-sienna-hills-usp.jpg?sfvrsn=4&build=1019&encoder=wic&useresizingpipeline=true&w=450&h=280&mode=crop"></li>
</ul>
</div>
<div class="pagination" style="width: 56px;"><ul><li class="active"></li><li></li></ul></div>
</div>
</div>
<div class="propertyInfoWrapper">
<div class="marker-details-container">
<h3 class="marker-details">New Homes in Buckeye, Arizona</h3>
<div class="spacer"></div>
<h4 class="propertyListingHeader">Aspire at Sienna Hills</h4>
<p class="marker-details">21007 West Almeria Road, Buckeye, AZ 85396</p>
<p class="marker-details marker-status">Final Opportunities</p>
<div class="spacer"></div>
<p class="marker-details marker-price"><span class="bold">Priced from: </span>Mid $200s</p>
<p class="marker-details"><span class="bold">Home type: </span>Single Family Homes</p>
<p class="marker-details marker-amenities"><span class="bold">Amenities: </span>Pool, Hiking Trails, Park</p>
</div>
<div class="community-tag-container">
<a href="/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills#quick-move-in-homes" onclick="KHOV.Analytics.trackEvent('Qmi_Icon_Qmi');">
<div class="community-tag">
<div class="ctaDesc quick-move-in-badge link-inside">Quick Move In Homes</div>
<div class="ctaIcon quick-move-in-badge-icon link-inside"></div>
</div>
</a>
</div>
<a href="#request-info-form-modal" class="open-inline-modal-link" onclick="KHOV.Analytics.trackEvent('Orange_Ri_Request_Info');">
<div class="button orange-color requestInfoButton link-inside" data-urlname="aspire-at-sienna-hills">Request Info</div>
</a>
</div>
</div>
You need to include the featured id as well as results. You can use Or to combine. Latest bs4 supports not.
#propertyResultsContainer .propertyWrapper :not([onclick])[href*=find], #propertyFeaturedResultsContainer .propertyWrapper :not([onclick])[href*=find]
This can also be shortened to
#propertyResultsContainer .propertyWrapper :not([onclick])[href*=find], #propertyFeaturedResultsContainer
But that shortening may be less robust.
You can just check for the desired keyword in the link and print those, and ignore the others:
if __name__ == '__main__':
url = "https://www.khov.com/find-new-homes/arizona/buckeye"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
for newlink in fetch_info(url):
if url.split('/')[-1] in newlink:
print(newlink)
driver.quit()
Output:
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/affinity-at-verrado
/find-new-homes/arizona/buckeye/85396/four-seasons/k.-hovnanian's-four-seasons-at-victory-at-verrado
Would list slicing works?
def fetch_info(link):
driver.get(link)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#community-search-homes .propertyWrapper > a")))
soup = BeautifulSoup(driver.page_source, "lxml")
linklist = [item.get("href") for item in soup.select("#community-search-homes .propertyWrapper > a")][:3]
return linklist
I am trying to scrape some information for a website using selenium below is the link to the website http://www.ultimatetennisstatistics.com/playerProfile?playerId=4742
the information i am trying to get is under the player 'statistics' my code right now opens the player's profile and then opens the player's statistics page i am trying to find a way to extract the information in the player's statistics page below is my code so far
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.ultimatetennisstatistics.com/playerProfile?playerId=4742")
soup = BeautifulSoup(driver.page_source,"lxml")
try:
dropdown = driver.find_element_by_xpath('//*[#id="playerPills"]/li[9]/a')
dropdown.click()
bm = driver.find_element_by_id('statisticsPill')
bm.click()
for i in soup.select('#statistics table.table tr'):
print(i)
data1 = [x.get_text(strip=True) for x in i.select("th,td")]
print(data1)
except ValueError:
print("error")
I
Serve
<th class="pct-data text-right"><i class="fa fa-percent"></i></th>
<th class="raw-data text-right" style="display: none;"><i class="fa fa-hashtag"></i></th>
</tr>
</thead>
<tbody>
<tr>
<td>Ace %</td>
<th class="text-right pct-data">23.4%</th>
<th class="raw-data text-right" style="display: none;">12942 / 55377</th>
</tr>
<tr>
<td>Double Fault %</td>
<th class="text-right pct-data">4.2%</th>
<th class="raw-data text-right" style="display:
To extract the information of the player's from the Statistics page you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("http://www.ultimatetennisstatistics.com/playerProfile?playerId=4742")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//ul[#id='playerPills']//a[#class='dropdown-toggle'][normalize-space()='Statistics']"))).click()
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//ul[#class='dropdown-menu']//a[#id='statisticsPill'][normalize-space()='Statistics']"))).click()
statistics_items = WebDriverWait(driver, 10).until(EC.visibility_of_any_elements_located((By.XPATH, "//table[#class='table table-condensed table-hover table-striped']//tbody//tr/td")))
statistics_value = WebDriverWait(driver, 10).until(EC.visibility_of_any_elements_located((By.XPATH, "//table[#class='table table-condensed table-hover table-striped']//tbody//tr//following::th[1]")))
for item, value in zip(statistics_items, statistics_value):
print('{} {}'.format(item.text, value.text))
Console Output:
Ace % 4.0%
Double Fault % 2.1%
1st Serve % 68.7%
1st Serve Won % 71.8%
2nd Serve Won % 57.3%
Break Points Saved % 66.3%
Service Points Won % 67.2%
Service Games Won % 85.6%
Ace Against % Return
Double Fault Against % 7.2%
1st Srv. Return Won % 3.4%
2nd Srv. Return Won % 34.2%
Break Points Won % 55.3%
Return Points Won % 44.9%
Return Games Won % 42.4%
Points Dominance 33.3%
Games Dominance Total
Break Points Ratio 1.29
Total Points Won % 2.31
Games Won % 1.33
Sets Won % 54.4%
Matches Won % 59.7%
Match Time 77.2%
The problem is with the location of this line -
soup = BeautifulSoup(driver.page_source,"lxml")
It should come AFTER you have clicked on the "Statistics" tab. Because then only the table loads and soup can parse it.
Final code -
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome(executable_path=r'//path/chromedriver.exe')
driver.get("http://www.ultimatetennisstatistics.com/playerProfile?playerId=4742")
try:
dropdown = driver.find_element_by_xpath('//*[#id="playerPills"]/li[9]/a')
dropdown.click()
bm = driver.find_element_by_id('statisticsPill')
bm.click()
driver.maximize_window()
soup = BeautifulSoup(driver.page_source,"lxml")
for i in soup.select('#statisticsOverview table tr'):
print(i.text)
data1 = [x.get_text(strip=True) for x in i.select("th,td")]
print(data1)
except ValueError:
print("error")