How to scrape specific words from table row? - python

I want to scrape codes only from below table using python
As in the Image, You can see I just want to scrape CPT, CTC, PTC, STC, SPT, HTC, P5TC, P1A, P2A P3A, P1E, P2E, P3E. This codes may change from time to time like the addition of P4E or removal of P1E.
HTML code for above table is:
<table class="list">
<tbody>
<tr>
<td>
<p>PRODUCT<br>DESCRIPTION</p>
</td>
<td>
<p><strong>Time Charter:</strong> CPT, CTC, PTC, STC, SPT, HTC, P5TC<br><strong>Time Charter Trip:</strong> P1A, P2A, P3A,<br>P1E, P2E, P3E</p>
</td>
<td><strong>Voyage: </strong>C3E, C4E, C5E, C7E</td>
</tr>
<tr>
<td>
<p>CONTRACT SIZE</p>
<p></p>
</td>
<td>
<p>1 day</p>
</td>
<td>
<p>1,000 metric tons</p>
</td>
</tr>
<tr>
<td>
<p>MINIMUM TICK</p>
<p></p>
</td>
<td>
<p>US$ 25</p>
</td>
<td>
<p>US$ 0.01</p>
</td>
</tr>
<tr>
<td>
<p>FINAL SETTLEMENT PRICE</p>
<p></p>
</td>
<td colspan="2" rowspan="1">
<p>The floating price will be the end-of-day price as supplied by the Baltic Exchange.</p>
<p><br><strong>All products:</strong> Final settlement price will be the mean of the daily Baltic Exchange spot price assessments for every trading day in the expiry month.</p>
<p><br><strong>Exception for P1A, P2A, P3A:</strong> Final settlement price will be the mean of the last 7 Baltic Exchange spot price assessments in the expiry month.</p>
</td>
</tr>
<tr>
<td>
<p>CONTRACT SERIES</p>
</td>
<td colspan="2" rowspan="1">
<p><strong><strong>CTC, CPT, PTC, STC, SPT, HTC, P5TC</strong>:</strong> Months, quarters and calendar years out to a maximum of 72 months</p>
<p><strong>C3E, C4E, C5E, C7E, P1A, P2A, P3A, P1E, P2E, P3E:</strong> Months, quarters and calendar years out to a maximum of 36 months</p>
</td>
</tr>
<tr>
<td>
<p>SETTLEMENT</p>
</td>
<td colspan="2" rowspan="1">
<p>At 13:00 hours (UK time) on the last business day of each month within the contract series</p>
</td>
</tr>
</tbody>
</table>
You can see code from below link of website
https://www.eex.com/en/products/global-commodities/freight

If your usecase is to scrape all the text:
You you have to induce WebDriverWait for the desired visibility_of_element_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get('https://www.eex.com/en/products/global-commodities/freight')
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p"))).text)
Using XPATH:
driver.get('https://www.eex.com/en/products/global-commodities/freight')
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p"))).text)
Console Output:
Time Charter: CPT, CTC, PTC, STC, SPT, HTC, P5TC
Time Charter Trip: P1A, P2A, P3A,
P1E, P2E, P3E
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Update 1
If you want to extract CPT, CTC, PTC, STC, SPT, HTC, P5TC and P1A, P2A, P3A and P1E, P2E, P3E individually, you can use the following solutions:
Printing CPT, CTC, PTC, STC, SPT, HTC, P5TC
#element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p")))
element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
print(driver.execute_script('return arguments[0].childNodes[1].textContent;', element).strip())
Printing P1A, P2A P3A
#element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p")))
element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
print(driver.execute_script('return arguments[0].childNodes[4].textContent;', element).strip())
Printing P1E, P2E, P3E
//element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "article div:last-child table>tbody>tr td:nth-child(2)>p")))
element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
print(driver.execute_script('return arguments[0].lastChild.textContent;', element).strip())
Update 2
To print all the items together:
Code Block:
driver.get('https://www.eex.com/en/products/global-commodities/freight')
element = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h3[text()='Contract Specifications']//following::table[1]/tbody/tr//following::td[1]/p")))
first = driver.execute_script('return arguments[0].childNodes[1].textContent;', element).strip()
second = driver.execute_script('return arguments[0].childNodes[4].textContent;', element).strip()
third = driver.execute_script('return arguments[0].lastChild.textContent;', element).strip()
for list in (first,second,third):
print(list)
Console Output:
CPT, CTC, PTC, STC, SPT, HTC, P5TC
P1A, P2A, P3A,
P1E, P2E, P3E

If variable txt contains HTML from your question, then this script extracts all required codes:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
text = soup.select_one('td:contains("Time Charter:")').text
codes = re.findall(r'[A-Z\d]{3}', text)
print(codes)
Prints:
['CPT', 'CTC', 'PTC', 'STC', 'SPT', 'HTC', 'P5T', 'P1A', 'P2A', 'P3A', 'P1E', 'P2E', 'P3E']
EDIT: To get codes from all tables, you can use this script:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
all_codes = []
for td in soup.select('td:contains("Time Charter:")'):
all_codes.extend(re.findall(r'[A-Z\d]{3}', td.text))
print(all_codes)

Related

I have a wbpage where everytime based on start and end date the textboxes increase and decrease I am not getting proper row count and txtbox coun

This is the code iam using to count number of rows, it is returning only 1:
tablen = driver.find_elements(By.XPATH, '//*[#id="m_mc_s0_z0_C_ctl00_tblForecast"]/tbody/tr')
tblength = len(tablen)
print(tblength)
Result : it is printing output as 1, but their are many rows
<table id="m_mc_s0_z0_C_ctl00_tblForecast">
<tbody><tr>
<td class="separatorLabel" colspan="15">Quarter 4 2021</td>
</tr><tr>
<td align="right" style="width:180px;">2021 November</td><td class="nonrequiredblock"></td><td class="ctl" style="width:160px;"><span id="m_mc_s0_z0_C_ctl00_c202111_wrapper" class="riSingle RadInput RadInput_Silk" style="width:80px;"><input id="m_mc_s0_z0_C_ctl00_c202111" name="m$mc$s0$z0$C$ctl00$c202111" class="riTextBox riEnabled" value="0.00" type="text"><input id="m_mc_s0_z0_C_ctl00_c202111_ClientState" name="m_mc_s0_z0_C_ctl00_c202111_ClientState" type="hidden" autocomplete="off" value="{"enabled":true,"emptyMessage":"","validationText":"0","valueAsString":"0","minValue":0,"maxValue":70368744177664,"lastSetTextBoxValue":"0.00"}"></span></td><td align="right" style="width:180px;">2021 December</td><td class="nonrequiredblock"></td><td class="ctl" style="width:160px;"><span id="m_mc_s0_z0_C_ctl00_c202112_wrapper" class="riSingle RadInput RadInput_Silk" style="width:80px;"><input id="m_mc_s0_z0_C_ctl00_c202112" name="m$mc$s0$z0$C$ctl00$c202112" class="riTextBox riEnabled" value="0.00" type="text"><input id="m_mc_s0_z0_C_ctl00_c202112_ClientState" name="m_mc_s0_z0_C_ctl00_c202112_ClientState" type="hidden" autocomplete="off" value="{"enabled":true,"emptyMessage":"","validationText":"0","valueAsString":"0","minValue":0,"maxValue":70368744177664,"lastSetTextBoxValue":"0.00"}"></span></td>
</tr><tr>
<td class="separatorLabel" colspan="15">Quarter 1 2022</td>
</tr><tr>
<td align="right" style="width:180px;">2022 January</td><td class="nonrequiredblock"></td><td class="ctl" style="width:160px;"><span id="m_mc_s0_z0_C_ctl00_c202201_wrapper" class="riSingle RadInput RadInput_Silk" style="width:80px;"><input id="m_mc_s0_z0_C_ctl00_c202201" name="m$mc$s0$z0$C$ctl00$c202201" class="riTextBox riEnabled" value="0.00" type="text"><input id="m_mc_s0_z0_C_ctl00_c202201_ClientState" name="m_mc_s0_z0_C_ctl00_c202201_ClientState" type="hidden" autocomplete="off" value="{"enabled":true,"emptyMessage":"","validationText":"0","valueAsString":"0","minValue":0,"maxValue":70368744177664,"lastSetTextBoxValue":"0.00"}"></span></td><td align="right" style="width:180px;">2022 February</td><td class="nonrequiredblock"></td><td class="ctl" style="width:160px;"><span id="m_mc_s0_z0_C_ctl00_c202202_wrapper" class="riSingle RadInput RadInput_Silk" style="width:80px;"><input id="m_mc_s0_z0_C_ctl00_c202202" name="m$mc$s0$z0$C$ctl00$c202202" class="riTextBox riEnabled" value="0.00" type="text"><input id="m_mc_s0_z0_C_ctl00_c202202_ClientState" name="m_mc_s0_z0_C_ctl00_c202202_ClientState" type="hidden" autocomplete="off" value="{"enabled":true,"emptyMessage":"","validationText":"0","valueAsString":"0","minValue":0,"maxValue":70368744177664,"lastSetTextBoxValue":"0.00"}"></span></td>
</tr>
</tbody></table>
Writing here in answer section as I have a snapshot to paste.
Is your locator correct? I built hmtl of your code and inspected DOM, and although your locator shows 4 element length, they doesn't seem to be the right ones.
See the snapshot here, DOM snapshot The locator highlights to a row with which you cannot do anything much useful. I have written some code block. Please check if this would be useful to you.
driver.get("url here")
time.sleep(3)
x = driver.find_elements(By.XPATH, "//*[#id='m_mc_s0_z0_C_ctl00_tblForecast']//tbody//tr")
print(f"Length of the element x is: {len(x)}")
date_txt = driver.find_elements(By.XPATH, "//*[#class='ctl']//../td[#align='right']")
print(f"lenght of date_txt is {len(date_txt)}")
ls=[i.text for i in date_txt]
print(ls)
ctl = driver.find_elements(By.XPATH, "//*[#class='ctl']//input[#type='text']")
print(f"length of ctl element is: {len(ctl)}")
for ele in ctl:
ele.clear()
ele.send_keys("1.00")
time.sleep(1)
print(ele.get_attribute('value'))
Output:
Length of the element x is: 4
lenght of date_txt is 4
['2021 November', '2021 December', '2022 January', '2022 February']
length of ctl element is: 4
1.00
1.00
1.00
1.00
Process finished with exit code 0

how to get text by moving cursor by selenium?

I am trying to get the text from each bar in the following plot.
Here is what I tried:
driver = webdriver.Chrome('d:/chromedriver.exe')
driver.get('https://dph.georgia.gov/covid-19-daily-status-report')
frame = driver.find_element_by_css_selector('#covid19dashdph > iframe')
driver.switch_to.frame(frame)
element = driver.find_element_by_xpath('//*[#id="root"]/div/div[3]/div[4]/div/div[4]/div/div')
print(element.text) # return ''
# action = ActionChains(driver)
# action.move_by_offset(1, 1)
My question is:
how to get the text value because I saw the text in the source page
How to move mouse cursor one bar by the other to get the next daily case number.
I just clicked on the svg tag and printed it's value which was in a tag on the site.
driver.get('https://dph.georgia.gov/covid-19-daily-status-report')
frame=WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#covid19dashdph > iframe')))
driver.switch_to.frame(frame)
svg=WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, " div.MuiBox-root.jss326 > div > svg")))
svg.click()
element=WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.MuiBox-root.jss326 > div > div")))
print(element.text)
Import
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Outputs
07Jun20
Confirmed Cases 524
7-day Moving Average 720.7
The html tag consists of:
<div class="c3-tooltip-container" style="position: absolute; pointer-events: none; display: none; top: 529.5px; left: 74.5px;">
<table class="c3-tooltip">
<tbody>
<tr><th colspan="2">07Jun20</th></tr>
<tr class="c3-tooltip-name--Confirmed-Cases">
<td class="name"><span style="background-color:#33a3ff"></span>Confirmed Cases</td>
<td class="value">524</td></tr>
<tr class="c3-tooltip-name--\37 -day-Moving-Average">
<td class="name"><span style="background-color:#ffcc32"></span>7-day Moving Average</td>
<td class="value">720.7</td>
</tr></tbody></table></div>

Python (Selenium): selenium.common.exceptions.WebDriverException: Message: An unknown error occurred while processing the specified command

I have a python code that downloads data from the table contained in a web page to a local csv file. The code has run into an exception saying an unknown error occurred. Please see below for details.
Error Message:
Traceback (most recent call last):
File "test.py", line 50, in <module>
wr.writerow([d.text for d in row.find_elements_by_css_selector('td')])
File "test.py", line 50, in <listcomp>
wr.writerow([d.text for d in row.find_elements_by_css_selector('td')])
File "C:\Users\username\PycharmProjects\Web_Scraping\venv\lib\site-packages\selenium\webdriver\remote\webelement.py", line 76, in text
return self._execute(Command.GET_ELEMENT_TEXT)['value']
File "C:\Users\username\PycharmProjects\Web_Scraping\venv\lib\site-packages\selenium\webdriver\remote\webelement.py", line 633, in _execute
return self._parent.execute(command, params)
File "C:\Users\username\PycharmProjects\Web_Scraping\venv\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "C:\Users\username\PycharmProjects\Web_Scraping\venv\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: An unknown error occurred while processing the specified command.
Python code:
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ActionChains
from selenium.common.exceptions import TimeoutException
import time
import csv
from datetime import datetime
# Locate Edge driver
driver = webdriver.Edge(executable_path = "C://Windows//SysWOW64//MicrosoftWebDriver.exe")
driver.maximize_window()
# Using Edge to open the steam website
driver.get("https://partner.steampowered.com")
# Pause the driver for better performance
driver.implicitly_wait(10)
# Enter email address
login_un = driver.find_element_by_id('username').send_keys("")
# Enter password
login_pw = driver.find_element_by_id('password').send_keys("")
# Click sign in tp log in
driver.find_element_by_id('login_btn_signin').click()
# Find the desired link
driver.find_element_by_link_text('Age of Empires II: Definitive Edition').click()
time.sleep(3)
# Locate the link for Current Players
driver.find_element_by_css_selector('#gameDataLeft > div:nth-child(1) > table > tbody > tr:nth-child(9) > td:nth-child(3) > a').click()
time.sleep(5)
# Locate 1 year for Current Players
driver.find_element_by_xpath('/html/body/center/div/div[3]/div[1]/em[1]').click()
# x.click()
time.sleep(3)
# Locate the table element
table = driver.find_element_by_css_selector('body > center > div > div:nth-child(13) > table')
# Open local csv and save data
filename = datetime.now().strftime('C:/Users/username/Desktop/Output/Concurrent_Players_%Y%m%d_%H%M.csv')
with open(filename, 'w', newline='') as csvfile:
wr = csv.writer(csvfile)
for row in table.find_elements_by_css_selector('tr'):
wr.writerow([d.text for d in row.find_elements_by_css_selector('td')])
print("Concurrent_Player data is saved. ")
HTML Source: (Sorry for not being able to provide the URL because this is an internal website. )
<div>
<table>
<tbody><tr>
<td></td>
<td></td>
<td align="right" title="2019-02-11 to 2020-02-09"><b>Most recent year</b></td>
<td></td> <!--Expandable percentage column-->
<td align="right"><b>Daily average during period<b></b></b></td>
<td align="right"><b>Change vs. previous period</b></td>
<td align="right"></td>
<td class="dim" align="right" title="2018-02-12 to 2019-02-10"><b>Previous year</b></td>
<td></td> <!--Expandable percentage column-->
<td class="dim" align="right"><b>Previous daily average<b></b></b></td>
</tr>
<tr>
<td>Average daily peak concurrent users </td>
<td></td>
<td align="right">4,032</td>
<td align="right"></td>
<td align="right">11</td>
<td align="right"><span style="color:#B5DB42;">+25971%</span></td>
<td width="16"></td>
<td class="dim" align="right">15</td>
<td class="dim" align="right"></td>
<td class="dim" align="right" width="100">0</td>
<td></td>
</tr>
<tr>
<td>Maximum daily peak concurrent users </td>
<td></td>
<td align="right">26,767</td>
<td align="right"></td>
<td align="right">74</td>
<td align="right"><span style="color:#B5DB42;">+51375%</span></td>
<td width="16"></td>
<td class="dim" align="right">52</td>
<td class="dim" align="right"></td>
<td class="dim" align="right" width="100">0</td>
<td></td>
</tr>
<tr>
<td>Average daily active users </td>
<td></td>
<td align="right">24,686</td>
<td align="right"></td>
<td align="right">68</td>
<td align="right"><span style="color:#B5DB42;">+70506%</span></td>
<td width="16"></td>
<td class="dim" align="right">35</td>
<td class="dim" align="right"></td>
<td class="dim" align="right" width="100">0</td>
<td></td>
</tr>
<tr>
<td>Maximum daily active users </td>
<td></td>
<td align="right">157,231</td>
<td align="right"></td>
<td align="right">432</td>
<td align="right"><span style="color:#B5DB42;">+191645%</span></td>
<td width="16"></td>
<td class="dim" align="right">82</td>
<td class="dim" align="right"></td>
<td class="dim" align="right" width="100">0</td>
<td></td>
</tr>
</tbody></table>
</div>
Screenshot of web UI:
The code does generate a csv file as specified but no data is saved due to the error. I have other similar python codes implemented the same way and succeed. However, I'm not able to troubleshoot by myself on this one. I hope the information provided is enough for you to review. Thanks so much in advance!
Induce WebdriverWait and visibility_of_element_located() and following xpath to identify the table and then find rows and then column values.
table=WebDriverWait(driver,20).until(EC.visibility_of_element_located((By.XPATH,"//table[contains(.,'Average daily peak concurrent users')]")))
for row in table.find_elements_by_xpath(".//tr"):
rowdata=[col.text for col in row.find_elements_by_xpath(".//td")]
print(rowdata)
Based on your example its printing following on console.
['', '', 'Most recent year', '', 'Daily average during period', 'Change vs. previous period', '', 'Previous year', '', 'Previous daily average']
['Average daily peak concurrent users', '', '4,032', '', '11', '+25971%', '', '15', '', '0', '']
['Maximum daily peak concurrent users', '', '26,767', '', '74', '+51375%', '', '52', '', '0', '']
['Average daily active users', '', '24,686', '', '68', '+70506%', '', '35', '', '0', '']
['Maximum daily active users', '', '157,231', '', '432', '+191645%', '', '82', '', '0', '']
Since table.find_elements_by_xpath(".//tr") was not working for this table, I used a very stupid way to walk around. It works fine for my case so far.
Updated code (Partial):
filename = datetime.now().strftime('C:/Users/username/Desktop/Output/data_%Y%m%d_%H%M.csv')
with open(filename, 'w', newline='', encoding="utf-8") as csvfile:
wr = csv.writer(csvfile)
a = driver.find_element_by_xpath('/html/body/center/div/div[4]/table/tbody/tr[1]/td[1]').text
b = driver.find_element_by_xpath('/html/body/center/div/div[4]/table/tbody/tr[1]/td[3]').text
c = driver.find_element_by_xpath('/html/body/center/div/div[4]/table/tbody/tr[1]/td[5]').text
d = driver.find_element_by_xpath('/html/body/center/div/div[4]/table/tbody/tr[1]/td[6]').text
e = driver.find_element_by_xpath('/html/body/center/div/div[4]/table/tbody/tr[1]/td[8]').text
f = driver.find_element_by_xpath('/html/body/center/div/div[4]/table/tbody/tr[1]/td[10]').text
wr.writerow([a, b, c, d, e, f])
print("Done. ")
driver.quit()
Reasoning:
What I've observed and found so far is that this table has empty td elements in tr. (See the spot where the cursor is at from the screenshot for an example.) Every horizontal cell next to another one has an empty/blank td. The compiler cannot handle empty tds then throws out an exception. So in my code I had to specify the exact td number to scan so it wouldn't time out.
If anyone can come up with the solution that can let the code avoid scanning empty tds or only scan the tds with a solid text/string, it would be an optimal solution.
There might be a timeout or stale element issue. Try getting the elements in the table like this.
#your code
#for row in table.find_elements_by_css_selector('tr'):
#wr.writerow([d.text for d in row.find_elements_by_css_selector('td')])
table_elements = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located(
(By.CSS_SELECTOR, 'tr td')))
for row in table_elements:
print(row.text)
#wr.writerow(row.text)

How to extract the player's information from the Statistics page as per the HTML?

I am trying to scrape some information for a website using selenium below is the link to the website http://www.ultimatetennisstatistics.com/playerProfile?playerId=4742
the information i am trying to get is under the player 'statistics' my code right now opens the player's profile and then opens the player's statistics page i am trying to find a way to extract the information in the player's statistics page below is my code so far
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.ultimatetennisstatistics.com/playerProfile?playerId=4742")
soup = BeautifulSoup(driver.page_source,"lxml")
try:
dropdown = driver.find_element_by_xpath('//*[#id="playerPills"]/li[9]/a')
dropdown.click()
bm = driver.find_element_by_id('statisticsPill')
bm.click()
for i in soup.select('#statistics table.table tr'):
print(i)
data1 = [x.get_text(strip=True) for x in i.select("th,td")]
print(data1)
except ValueError:
print("error")
I
Serve
<th class="pct-data text-right"><i class="fa fa-percent"></i></th>
<th class="raw-data text-right" style="display: none;"><i class="fa fa-hashtag"></i></th>
</tr>
</thead>
<tbody>
<tr>
<td>Ace %</td>
<th class="text-right pct-data">23.4%</th>
<th class="raw-data text-right" style="display: none;">12942 / 55377</th>
</tr>
<tr>
<td>Double Fault %</td>
<th class="text-right pct-data">4.2%</th>
<th class="raw-data text-right" style="display:
To extract the information of the player's from the Statistics page you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument('disable-infobars')
driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
driver.get("http://www.ultimatetennisstatistics.com/playerProfile?playerId=4742")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//ul[#id='playerPills']//a[#class='dropdown-toggle'][normalize-space()='Statistics']"))).click()
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//ul[#class='dropdown-menu']//a[#id='statisticsPill'][normalize-space()='Statistics']"))).click()
statistics_items = WebDriverWait(driver, 10).until(EC.visibility_of_any_elements_located((By.XPATH, "//table[#class='table table-condensed table-hover table-striped']//tbody//tr/td")))
statistics_value = WebDriverWait(driver, 10).until(EC.visibility_of_any_elements_located((By.XPATH, "//table[#class='table table-condensed table-hover table-striped']//tbody//tr//following::th[1]")))
for item, value in zip(statistics_items, statistics_value):
print('{} {}'.format(item.text, value.text))
Console Output:
Ace % 4.0%
Double Fault % 2.1%
1st Serve % 68.7%
1st Serve Won % 71.8%
2nd Serve Won % 57.3%
Break Points Saved % 66.3%
Service Points Won % 67.2%
Service Games Won % 85.6%
Ace Against % Return
Double Fault Against % 7.2%
1st Srv. Return Won % 3.4%
2nd Srv. Return Won % 34.2%
Break Points Won % 55.3%
Return Points Won % 44.9%
Return Games Won % 42.4%
Points Dominance 33.3%
Games Dominance Total
Break Points Ratio 1.29
Total Points Won % 2.31
Games Won % 1.33
Sets Won % 54.4%
Matches Won % 59.7%
Match Time 77.2%
The problem is with the location of this line -
soup = BeautifulSoup(driver.page_source,"lxml")
It should come AFTER you have clicked on the "Statistics" tab. Because then only the table loads and soup can parse it.
Final code -
from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome(executable_path=r'//path/chromedriver.exe')
driver.get("http://www.ultimatetennisstatistics.com/playerProfile?playerId=4742")
try:
dropdown = driver.find_element_by_xpath('//*[#id="playerPills"]/li[9]/a')
dropdown.click()
bm = driver.find_element_by_id('statisticsPill')
bm.click()
driver.maximize_window()
soup = BeautifulSoup(driver.page_source,"lxml")
for i in soup.select('#statisticsOverview table tr'):
print(i.text)
data1 = [x.get_text(strip=True) for x in i.select("th,td")]
print(data1)
except ValueError:
print("error")

Using Python and Beautifulsoup how do I select the desired table in a div?

I would like to be able to select the table containing the "Accounts Payable" text but I'm not getting anywhere with what I'm trying and I'm pretty much guessing using findall. Can someone show me how I would do this?
For example this is what I start with:
<div>
<tr>
<td class="lft lm">Accounts Payable
</td>
<td class="r">222.82</td>
<td class="r">92.54</td>
<td class="r">100.34</td>
<td class="r rm">99.95</td>
</tr>
<tr>
<td class="lft lm">Accrued Expenses
</td>
<td class="r">36.49</td>
<td class="r">33.39</td>
<td class="r">31.39</td>
<td class="r rm">36.47</td>
</tr>
</div>
And this is what I would like to get as a result:
<tr>
<td class="lft lm">Accounts Payable
</td>
<td class="r">222.82</td>
<td class="r">92.54</td>
<td class="r">100.34</td>
<td class="r rm">99.95</td>
</tr>
You can select the td elements with class lft lm and then examine the element.string to determine if you have the "Accounts Payable" td:
import sys
from BeautifulSoup import BeautifulSoup
# where so_soup.txt is your html
f = open ("so_soup.txt", "r")
data = f.readlines ()
f.close ()
soup = BeautifulSoup ("".join (data))
cells = soup.findAll('td', {"class" : "lft lm"})
for cell in cells:
# You can compare cell.string against "Accounts Payable"
print (cell.string)
If you would like to examine the following siblings for Accounts Payable for instance, you could use the following:
if (cell.string.strip () == "Accounts Payable"):
sibling = cell.findNextSibling ()
while (sibling):
print ("\t" + sibling.string)
sibling = sibling.findNextSibling ()
Update for Edit
If you would like to print out the original HTML, just for the siblings that follow the Accounts Payable element, this is the code for that:
lines = ["<tr>"]
for cell in cells:
lines.append (cell.prettify().decode('ascii'))
if (cell.string.strip () == "Accounts Payable"):
sibling = cell.findNextSibling ()
while (sibling):
lines.append (sibling.prettify().decode('ascii'))
sibling = sibling.findNextSibling ()
lines.append ("</tr>")
f = open ("so_soup_out.txt", "wt")
f.writelines (lines)
f.close ()

Categories