Use BeautifulSoup to get info from table with Python

Use BeautifulSoup to get info from table with Python - python

I installed BeautifulSoup, read the documentation and found some tutorials on getting info from a table, but only from basic tables with a couple rows and columns.
I'm having trouble understanding how to do something more complex.
I have an html doc and use
table = soup.findAll("table")
To find the table. What I get back is a bunch of listings looking like this:
<tbody class="item" data-buyout="8 exalted" data-ign="POOPOODOODOO" data-league="Standard"
data-name="Dusk Stone Cobalt Jewel" data-seller="LooseSausage" data-sellerid="3447078"
data-thread="1285903" id="item-container-0">
<tr class="first-line"> <td class="icon-td"> <div class="icon"><img alt="Item icon" src="http://webcdn.pathofexile.com/image/Art/2DItems/Jewels/basicint.png?scale=1&w=1&h=1&v=cd579ea22c05f1c6ad2fd015d7a710bd3">\n<div class="sockets" style="position: absolute;">
\n<div class="sockets-inner" style="position: relative; width:94px;">\n</div>\n</div></img></div> </td>
<td class="item-cell">
<h5><a class="title itemframe2" href="http://www.pathofexile.com/forum/view-thread/1285903" target="_blank"> Dusk Stone Cobalt Jewel </a></h5><p class="requirements"> </p> <span class="sockets-raw" style="display:none"></span> <ul class="item-mods"><li class="bullet-item"><ul class="mods">
<li class="sortable " data-name="##% increased Attack Speed" data-value="4.0" style=""><b>4</b>% increased Attack Speed</li>
<li class="sortable " data-name="#Minions have #% increased maximum Life" data-value="10.0" style="">Minions have <b>10</b>% increased maximum Life</li>
<li class="sortable " data-name="##% increased Ignite Duration on Enemies" data-value="5.0" style=""><b>5</b>% increased Ignite Duration on Enemies</li><li class="sortable active" data-name="#Minions deal #% increased Damage" data-value="16.0" style="">Minions deal <b>16</b>% increased Damage</li>
<li class="sortable " data-name="##% chance to Ignite" data-value="2.0" style=""><b>2</b>% chance to Ignite</li></ul></li></ul> </td> <td class="table-stats"> <table> <tr class="calibrate"> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> <th></th> </tr> <tr class="cell-first"> <th class="disabled" colspan="2">Quality</th> <th class="disabled" colspan="2">Phys.</th> <th class="disabled" colspan="2">Elem.</th> <th class="disabled" colspan="2">APS</th> <th class="disabled" colspan="2">DPS</th> <th class="disabled" colspan="2">pDPS</th> <th class="disabled" colspan="2">eDPS</th> </tr> <tr class="cell-first"> <td class="sortable property " colspan="2" data-name="q" data-value="0"> 0<span class="capped">+20</span> </td> <td class="sortable property " colspan="2" data-name="quality_pd" data-value="0.0"> </td> <td class="sortable property " colspan="2" data-ed="" data-name="ed" data-value="0.0"> </td> <td class="sortable property " colspan="2" data-name="aps" data-value="0"> \xa0 </td> <td class="sortable property " colspan="2" data-name="quality_dps" data-value="0.0"> \xa0 </td> <td class="sortable property " colspan="2" data-name="quality_pdps" data-value="0.0"> \xa0 </td> <td class="sortable property " colspan="2" data-name="edps" data-value="0.0"> \xa0 </td> </tr> <tr class="cell-second"> <th class="cell-empty"></th> <th class="disabled" colspan="2">Armour</th> <th class="disabled" colspan="2">Evasion</th> <th class="disabled" colspan="2">Shield</th> <th class="disabled" colspan="2">Block</th> <th class="disabled" colspan="2">Crit.</th> <th class="disabled" colspan="2">Level</th> </tr> <tr class="cell-second"> <td class="cell-empty"></td> <td class="sortable property " colspan="2" data-name="quality_armour" data-value="0.0"> \xa0 </td> <td class="sortable property " colspan="2" data-name="quality_evasion" data-value="0.0"> \xa0 </td> <td class="sortable property " colspan="2" data-name="quality_shield" data-value="0.0"> \xa0 </td> <td class="sortable property " colspan="2" data-name="block" data-value="0"> \xa0 </td> <td class="sortable property " colspan="2" data-name="crit" data-value="0"> \xa0 </td> <td class="sortable property " colspan="2" data-name="level" data-value="0"> \xa0 </td> </tr> </table> </td> </tr> <tr class="bottom-row"> <td class="first-cell"></td> <td> <span class="requirements"> <span class="sortable" data-name="price_in_chaos" data-value="-212.0"><span class="has-tip currency currency-exalted" data-tooltip="" title="8 exalted">8\xd7</span></span> \xb7 <span class="click-button" data-hash="61053af5f4c3a82e330b3e38192b8480" data-thread="1285903" onclick="verify_modern(this)">Verify</span> \xb7 <span class="success label">online</span> IGN: POOPOODOODOO \xb7 <span class="has-tip" data-tooltip="" title="account age and highest level">a857h93</span> \xb7 PM \xb7 Whisper </span> </td> <td class="third-cell" colspan="16"></td> </tr> <tr><td class="item-separator" colspan="16"></td></tr>
</tbody>
What I would like to do is take the parts that say data-buyout, data-ign, data-name etc and save them in variables for use later. Then skip down to the li class="sortable" and get the data-name part or the text.
(Since they seem to be the same thing)
Im having trouble understanding how to do this all within the tbody class= "item" section.
I need to do this several times, since there are multiple items with the tbody class="item" in each table.
Any information I can get is greatly appreciated!

You may get data as follows:
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_string_above)
>>>soup.tbody.attrs
{'data-sellerid': '3447078', 'data-buyout': '8 exalted', 'data-league': 'Standard', 'data-name': 'Dusk Stone Cobalt Jewel', 'data-thread': '1285903', 'id': 'item-container-0', 'data-ign': 'POOPOODOODOO', 'data-seller': 'LooseSausage', 'class': ['item']}
>>>[x.get_text(strip=True) for x in soup.select('li')]
['4% increased Attack SpeedMinions have10% increased maximum Life5% increased Ignite Duration on EnemiesMinions deal16% increased Damage2% chance to Ignite', '4% increased Attack Speed', 'Minions have10% increased maximum Life', '5% increased Ignite Duration on Enemies', 'Minions deal16% increased Damage', '2% chance to Ignite']

Related

How to extract the following lines after pattern match

the web source is like this:
<div class="MT12">
<table class="tblchart" border="0" cellspacing="0" cellpadding="0">
<tr>
<th rowspan="2" width="100" align="left" valign="top">Date</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Open</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">High</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Low</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Close</th>
<th colspan="2" style="text-align:center;" valign="top">- SPREAD -</th>
</tr>
<tr>
<th width="100" style="text-align:right;" valign="top">(High-Low)</th>
<th width="100" style="text-align:right;" valign="top" class="last">(Open-Close)</th>
</tr>
<tr>
<td align="left" valign="top">2019-12-24</td>
<td valign="top" style="text-align:right;">12269.25</td>
<td valign="top" class="b_12vv" style="text-align:right">12283.70</td>
<td valign="top" style="text-align:right;">12202.10</td>
<td valign="top" style="text-align:right;">12214.55</td>
<td valign="top" style="text-align:right;">81.60</td>
<td align="right" valign="top" class="last" style="text-align:right;">54.70</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-23</td>
<td valign="top" style="text-align:right;">12235.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12287.15</td>
<td valign="top" style="text-align:right;">12213.25</td>
<td valign="top" style="text-align:right;">12262.75</td>
<td valign="top" style="text-align:right;">73.90</td>
<td align="right" valign="top" class="last" style="text-align:right;">-27.30</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-20</td>
<td valign="top" style="text-align:right;">12266.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12293.90</td>
<td valign="top" style="text-align:right;">12252.75</td>
<td valign="top" style="text-align:right;">12271.80</td>
<td valign="top" style="text-align:right;">41.15</td>
<td align="right" valign="top" class="last" style="text-align:right;">-5.35</td>
</tr>
</table>
</div>
I want to get the following numbers for every date:
say for example I have to get the numbers 12269.25, 12283.70, 12202.10 and 12214.55 for a particular date (2019-12-24). Then proceed for the next date given.
I am facing difficulty because I need to select next 4 lines(whose xpath is not exatly related much as shown above) following each date in the page. The dates can range from single date to 100-200 dates.
Can anybody please help with webdriver code snippet for the same.
Thanks a lot

Can this meet your needs
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<div class="MT12">
<table class="tblchart" border="0" cellspacing="0" cellpadding="0">
<tr>
<th rowspan="2" width="100" align="left" valign="top">Date</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Open</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">High</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Low</th>
<th rowspan="2" width="100" style="text-align:right;" valign="top">Close</th>
<th colspan="2" style="text-align:center;" valign="top">- SPREAD -</th>
</tr>
<tr>
<th width="100" style="text-align:right;" valign="top">(High-Low)</th>
<th width="100" style="text-align:right;" valign="top" class="last">(Open-Close)</th>
</tr>
<tr>
<td align="left" valign="top">2019-12-24</td>
<td valign="top" style="text-align:right;">12269.25</td>
<td valign="top" class="b_12vv" style="text-align:right">12283.70</td>
<td valign="top" style="text-align:right;">12202.10</td>
<td valign="top" style="text-align:right;">12214.55</td>
<td valign="top" style="text-align:right;">81.60</td>
<td align="right" valign="top" class="last" style="text-align:right;">54.70</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-23</td>
<td valign="top" style="text-align:right;">12235.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12287.15</td>
<td valign="top" style="text-align:right;">12213.25</td>
<td valign="top" style="text-align:right;">12262.75</td>
<td valign="top" style="text-align:right;">73.90</td>
<td align="right" valign="top" class="last" style="text-align:right;">-27.30</td>
</tr>
<tr>
<td align="left" valign="top">2019-12-20</td>
<td valign="top" style="text-align:right;">12266.45</td>
<td valign="top" class="b_12vv" style="text-align:right">12293.90</td>
<td valign="top" style="text-align:right;">12252.75</td>
<td valign="top" style="text-align:right;">12271.80</td>
<td valign="top" style="text-align:right;">41.15</td>
<td align="right" valign="top" class="last" style="text-align:right;">-5.35</td>
</tr>
</table>
</div>'''
doc = SimplifiedDoc(html)
table = doc.getElement(tag='table',value='tblchart')
trs = table.trs.notContains('<th') # get tr
for tr in trs:
tds = tr.tds # get all td
data = [td.text for td in tds]
print (data[0],data[1],data[2],data[3],data[4])

Get <td> text using python selenium

<html>
<body>
<table style="border:0">
<tbody>
<tr class="">
<td class="pr10">Mon</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Tue</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="bold">
<td class="pr10">Wed</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Thu</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Fri</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Sat</td>
<td class="pl10">11am – 11pm</td>
</tr>
<tr class="">
<td class="pr10">Sun</td>
<td class="pl10">11am – 11pm</td>
</tr>
</tbody>
</table>
</html>
</body>
Try 1:
driver.find_elements_by_xpath("//*[#class='pr10']")
Try 2:
driver.find_element_by_xpath("//tr[td='Mon']/td").text
But it not fetching the text "Mon" "11am - 11pm"
text_area = driver.find_elements_by_xpath("//*[#class='pr10']")
for items2 in text_area:
print(items2.text)

try this instead:
text_area = driver.find_elements_by_xpath("""//*[#id="body"]/table/tbody/tr[1]/td[1]""")
print([elm.get_attribute('innerHTML') for elm in text_area])

How to find a value in a table with no identifiers? (Python, Selenium)

I have a webpage with a table with many rows. A user will give me a number (15308) which can be found in the top line with the first <td> tag, and this is the only information I will have. I want to be able to use this number to find the data between the <th></th> tag (more specifically the 0), but only for the table row. For example, I attached two table rows and I want the <th> data using the number 15308, but not the <th> data from the table row that has the number 15309 in it's first <td>. Any help is appreciated!
Desired Output: 0
<tr>
<td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>15309</td>
<td nowrap="">INFO 101 </td>
<td>AA</td>
<td align="CENTER">LB</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 26</td>
<th align="CENTER" style=""> 2</th><td align="CENTER"> 21</td>
<td></td>
</tr>

Use Following code :
userValue='15308'
all_td_th_of_row = driver.find_elements_by_xpath("//td[normalize-space()='" + userValue + "']//following-sibling::td|th")
i = 0
while i<len(all_td_th_of_row) :
print(all_td_th_of_row[i].text)
i=i+1

Something I have always found beautiful, using beauitfulsoup:
Using the xpath="1" as an attribute:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
xpathTh = soup.find('th', attrs={'xpath': '1'})
print(xpathTh.text.strip())
OUTPUT:
0
EDIT:
To get all the values from the attrib:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<th align="CENTER" style="" xpath="1"> 1</th><td align="CENTER"> 229</td>
<th align="CENTER" style="" xpath="1"> 2</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
xpathTh = soup.find_all('th', attrs={'xpath': '1'})
for elem in xpathTh:
print(elem.text.strip())
OUTPUT:
0
1
2
EDIT 2:
Considering you only want the xpath value if the anchor tag inside the td (inside tr) has a value of 15308:
line = '''<tr><td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>22222</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER" style="" xpath="1"> 1</th><td align="CENTER"> 229</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
trElems = soup.find_all('tr')
toFind = '15308'
for tr in trElems:
val = tr.select('td a')[0].text
if toFind == val:
xpathTh = tr.find_all('th', attrs={'xpath': '1'})
for elem in xpathTh:
print(elem.text.strip())
OUTPUT:
0
EDIT 3:
Continuing from comments:
line = '''<tr>
<td>15308</td>
<td nowrap="">INFO 101 </td>
<td>A </td>
<td align="CENTER">LC</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 150</td>
<td align="CENTER"> 250</td>
<th align="CENTER"> 0</th><td align="CENTER"> 229</td>
<td></td>
</tr>
<tr><td>15309</td>
<td nowrap="">INFO 101 </td>
<td>AA</td>
<td align="CENTER">LB</td>
<td>SOCIAL NETWORKING </td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 25</td>
<td align="CENTER"> 26</td>
<th align="CENTER" style=""> 2</th><td align="CENTER"> 21</td>
<td></td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(line, 'html.parser')
trElems = soup.find_all('tr')
toFind = '15308'
for tr in trElems:
val = tr.select('td a')[0].text
if toFind == val:
xpathTh = tr.find_all('td')[7]
print("For the value: {}, The result is {}".format(toFind, xpathTh.find_next('th').text.strip()))
OUTPUT:
For the value: 15308, The result is 0

Table extraction: BeautifulSoup vs. Pandas.read_html

I have an html file taken from this link, but I am not being able to extract any sort of table neither with bs4.BeautifulSoup() nor with pandas.read_html. I understand that each row of my desired table starts with <tr class='odd'>. Despite that, something is not working when I pass soup.find({'class': 'odd'}) or pd.read_html(url, attrs = {'class': 'odd'}). Where is the mistake or what should I do instead?
The beginning of the table apparently starts in requests.get(url).content[8359:].
<table style="background-color:#FFFEEE; border-width:thin; border-collapse:collapse; border-spacing:0; border-style:outset;" rules="groups" >
<colgroup>
<colgroup>
<colgroup>
<colgroup>
<colgroup span="3">
<colgroup span="3">
<colgroup span="3">
<colgroup span="3">
<colgroup>
<tbody>
<tr style="vertical-align:middle; background-color:#177A9C">
<th scope="col" style="text-align:center">Ion</th>
<th scope="col" style="text-align:center"> Observed <br /> Wavelength <br /> Vac (nm) </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>g<sub>k</sub>A<sub>ki</sub></i><br /> (10<sup>8</sup> s<sup>-1</sup>) </th>
<th scope="col"> Acc. </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>E<sub>i</sub></i> <br /> (eV) </th>
<th> </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>E<sub>k</sub></i> <br /> (eV) </th>
<th scope="col" style="text-align:center" colspan="3"> Lower Level <br /> Conf., Term, J </th>
<th scope="col" style="text-align:center" colspan="3"> Upper Level <br /> Conf., Term, J </th>
<th scope="col" style="text-align:center"> <i>g<sub>i</sub></i> </th>
<th scope="col" style="text-align:center"> <b>-</b> </th>
<th scope="col" style="text-align:center"> <i>g<sub>k</sub></i> </th>
<th scope="col" style="text-align:center"> Type </th>
</tr>
</tbody>
<tbody>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr class='odd'>
<td class="lft1"><b>C I</b> </td>
<td class="fix"> 193.090540 </td>
<td class="lft1">1.02e+01 </td>
<td class="lft1"> A</td>
<td class="fix">1.2637284 </td>
<td class="dsh">- </td>
<td class="fix">7.68476771 </td>
<td class="lft1"> 2<i>s</i><sup>2</sup>2<i>p</i><sup>2</sup> </td>
<td class="lft1"> <sup>1</sup>D </td>
<td class="lft1"> 2 </td>
<td class="lft1"> 2<i>s</i><sup>2</sup>2<i>p</i>3<i>s</i> </td>
<td class="lft1"> <sup>1</sup>P° </td>
<td class="lft1"> 1 </td>
<td class="rgt"> 5</td>
<td class="dsh">-</td>
<td class="lft1">3 </td>
<td class="cnt"><sup></sup><sub></sub></td>
</tr>

This code can give you a jump start on this project, however, if you're looking for someone to build the whole project, request data, scrape, store, manipulate I would suggest hiring someone or learning how to do it. HERE is the BeautifulSoup Documentation.
Go through (the quickstart guide) it once and you'll pretty much be know all there is on bs4.
import requests
from bs4 import BeautifulSoup
from time import sleep
url = 'https://physics.nist.gov/'
second_part = 'cgi-bin/ASD/lines1.pl?spectra=C%20I%2C%20Ti%20I&limits_type=0&low_w=190&upp_w=250&unit=1&de=0&format=0&line_out=0&no_spaces=on&remove_js=on&en_unit=1&output=0&bibrefs=0&page_size=15&show_obs_wl=1&unc_out=0&order_out=0&max_low_enrg=&show_av=2&max_upp_enrg=&tsb_value=0&min_str=&A_out=1&A8=1&max_str=&allowed_out=1&forbid_out=1&min_accur=&min_intens=&conf_out=on&term_out=on&enrg_out=on&J_out=on&g_out=on&submit=Retrieve%20Data%27'
page = requests.get(url+second_part)
soup = BeautifulSoup(page.content, "lxml")
whole_table = soup.find('table', rules='groups')
sub_tbody = whole_table.find_all('tbody')
# the two above lines are used to locate the table and the content
# we then continue to iterate through sub-categories i.e. tbody-s > tr-s > td-s
for tag in sub_tbody:
if tag.find('tr').find('td'):
table_rows = tag.find_all('tr')
for tag2 in table_rows:
if tag2.has_attr('class'):
td_tags = tag2.find_all('td')
print(td_tags[0].text, '<- Is the ion')
print(td_tags[1].text, '<- Wavelength')
print(td_tags[2].text, '<- Some formula gk Aki')
# and so on...
print('--'*40) # unecessary but does print ----------...
else:
pass

You need to search for the tags and then the class. So using the lxml parser;
soup = BeautifulSoup(yourdata, 'lxml')
for i in soup.find_all('tr',attrs={'class':"odd"}):
print(i.text)
From this point you can write this data directly to a file or generate an array (list of lists - your rows) then put into pandas etc etc.

beautiful soup get children that are Tags (not Navigable Strings) from a Tag

Beautiful soup documentation provides attributes .contents and .children to access the children of a given tag (a list and an iterable respectively), and includes both Navigable Strings and Tags. I want only the children of type Tag.
I'm currently accomplishing this using list comprehension:
rows=[x for x in table.tbody.children if type(x)==bs4.element.Tag]
but I'm wondering if there is a better/more pythonic/built-in way to get just Tag children.

thanks to J.F.Sebastian , the following will work:
rows=table.tbody.find_all(True, recursive=False)
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#true
In my case, I needed actual rows in the table, so I ended up using the following, which is more precise and I think more readable:
rows=table.tbody.find_all('tr')
Again, docs: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names
I believe this is a better way than iterating through all the children of a Tag.
Worked with the following input:
<table cellspacing="0" cellpadding="0">
<thead>
<tr class="title-row">
<th class="title" colspan="100">
<div style="position:relative;">
President
<span class="pct-rpt">
99% reporting
</span>
</div>
</th>
</tr>
<tr class="header-row">
<th class="photo first">
</th>
<th class="candidate ">
Candidate
</th>
<th class="party ">
Party
</th>
<th class="votes ">
Votes
</th>
<th class="pct ">
Pct.
</th>
<th class="change ">
Change from ‘08
</th>
<th class="evotes last">
Electoral Votes
</th>
</tr>
</thead>
<tbody>
<tr class="">
<td class="photo first">
<div class="photo_wrap"><img alt="P-barack-obama" height="48" src="http://i1.nyt.com/projects/assets/election_2012/images/candidate_photos/election_night/p-barack-obama.jpg?1352320690" width="68" /></div>
</td>
<td class="candidate ">
<div class="winner dem"><img alt="Hp-checkmark#2x" height="9" src="http://i1.nyt.com/projects/assets/election_2012/images/swatches/hp-checkmark#2x.png?1352320690" width="10" />Barack Obama</div>
</td>
<td class="party ">
Dem.
</td>
<td class="votes ">
2,916,811
</td>
<td class="pct ">
57.3%
</td>
<td class="change ">
-4.6%
</td>
<td class="evotes last">
20
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Mitt Romney</div>
</td>
<td class="party ">
Rep.
</td>
<td class="votes ">
2,090,116
</td>
<td class="pct ">
41.1%
</td>
<td class="change ">
+4.3%
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Gary Johnson</div>
</td>
<td class="party ">
Lib.
</td>
<td class="votes ">
54,798
</td>
<td class="pct ">
1.1%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="last-row">
<td class="photo first">
</td>
<td class="candidate ">
div class="not-winner">Jill Stein</div>
</td>
<td class="party ">
Green
</td>
<td class="votes ">
29,336
</td>
<td class="pct ">
0.6%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr>
<td class="footer" colspan="100">
President Map |
President Big Board |
Exit Polls
</td>
</tr>
</tbody>
</table>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use BeautifulSoup to get info from table with Python - python

Related

How to extract the following lines after pattern match

Get <td> text using python selenium

How to find a value in a table with no identifiers? (Python, Selenium)

Table extraction: BeautifulSoup vs. Pandas.read_html

beautiful soup get children that are Tags (not Navigable Strings) from a Tag

Categories

Resources