Get the table(only values) from two different tables? - python

I want to get or select data from two different tables with same class.I tried getting it from 'soup.find_all' but formatting the data is getting tough.
There are two tables with same class. I need to get only values(not label) from the tables.
TABLE 1:
<div class="bh_collapsible-body" style="display: none;">
<table border="0" cellpadding="2" cellspacing="2" class="prop-list">
<tbody>
<tr>
<td class="item">
<table>
<tbody>
<tr>
<td class="label">Rim Material</td>
<td class="value">Alloy</td>
</tr>
</tbody>
</table>
</td>
<td class="item">
<table>
<tbody>
<tr>
<td class="label">Front Tyre Description</td>
<td class="value">215/55 R16</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td class="item">
<table>
<tbody>
<tr>
<td class="label">Front Rim Description</td>
<td class="value">16x7.0</td>
</tr>
</tbody>
</table>
</td>
<td class="item">
<table>
<tbody>
<tr>
<td class="label">Rear Tyre Description</td>
<td class="value">215/55 R16</td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td class="item">
<table>
<tbody>
<tr>
<td class="label">Rear Rim Description</td>
<td class="value">16x7.0</td>
</tr>
</tbody>
</table>
</td>
<td></td>
</tr>
</tbody>
</table>
</div>
</div>
TABLE 2:
<div class="bh_collapsible-body" style="display: none;">
<table border="0" cellpadding="2" cellspacing="2" class="prop-list">
<tbody>
<tr>
<td class="item">
<table>
<tbody>
<tr>
<td class="label">Steering</td>
<td class="value">Rack and Pinion</td>
</tr>
</tbody>
</table>
</td>
<td></td>
</tr>
</tbody>
</table>
</div>
</div>
What i have tried:
I tried getting the first table contents from Xpath but its giving with both values and labels.
table1 = driver.find_element_by_xpath("//*[#id='features']/div/div[5]/div[2]/div[1]/div[1]/div/div[2]/table/tbody/tr[1]/td[1]/table/tbody/tr/td[2]")
I tried to split the data but not succeeded

I think you are looking for CSS selector tr:not(:has(tr)), this will select the inner-most <tr>:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser') # the variable data contains string for Table1 and Table2 in your question
rows = []
for tr in soup.select('tr:not(:has(tr))'):
rows.append([td.get_text(strip=True) for td in tr.select('td')])
for row in zip(*rows):
print(''.join('{: ^25}'.format(d) for d in row))
Prints:
Rim Material Front Tyre Description Front Rim Description Rear Tyre Description Rear Rim Description Steering
Alloy 215/55 R16 16x7.0 215/55 R16 16x7.0 Rack and Pinion
The variable rows contains:
[['Rim Material', 'Alloy'],
['Front Tyre Description', '215/55 R16'],
['Front Rim Description', '16x7.0'],
['Rear Tyre Description', '215/55 R16'],
['Rear Rim Description', '16x7.0'],
['Steering', 'Rack and Pinion']]
Further reading:
CSS Selectors Reference
EDIT: Changed to CSS Selector to tr:not(:has(tr))

Related

How to beautifulsoup in this case without class or id

How to get the text of 'Wow, you get it!' i can print the Date, but i cant get the td that come next of the date.
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr bgcolor="#505050">
<td class="white" colspan="2">
<b>
Account Here
</b>
</td>
</tr>
<tr bgcolor="#F1E0C6">
<td colspan="2">
There is nothing
</td>
</tr>
</table>
<br/>
<br/>
<table border="0" cellpadding="4" cellspacing="1" width="100%">
<tr bgcolor="#505050">
<td class="white" colspan="2">
<b>
Death
</b>
</td>
</tr>
<tr bgcolor="#F1E0C6">
<td valign="top" width="25%">
Aug 15 2021, 18:36:22 CEST
</td>
<td>
Wow, you get it!
</td>
</tr>
<tr bgcolor="#D4C0A1">
<td valign="top" width="25%">
Aug 01 2021, 21:25:39 CEST
</td>
<td>
Next Time
</td>
</tr>
</table>
i got the date with this code:
print(soup.find_all('td', {'valign': 'top'})[0].get_text())
show this
Aug 15 2021, 18:36:22 CEST
but i cant find any solution to get the next td of the date
If html_doc contains the HTML snippet from the question:
soup = BeautifulSoup(html_doc, "html.parser")
txt = soup.select_one('td[valign="top"] + td').get_text(strip=True)
print(txt)
Prints:
Wow, you get it!
Or:
txt = soup.find("td", {"valign": "top"}).find_next("td").get_text(strip=True)

Extracting multiple table data using python and beautiful soup

<div class="row margin_30">
<div class="col-md-12 col-sm-12 col-xs-12 col-lg-12">
<div class="table-responsive table-border-radius">
<table class="table table-hover result-table-new1 " style="margin:0">
<thead class="">
<tr class="">
<th style="text-align:center;">Pl</th>
<th>H.No</th>
<th>Horse/Pedigree</th>
<th>Desc</th>
<th>Trainer</th>
<th>Jockey</th>
<th>Wt</th>
<th>Al</th>
<th>Dr</th>
<th>Sh</th>
<th>Won By</th>
<th>Dist Win</th>
<th>Rtg</th>
<th>Odds</th>
<th>Time</th>
</tr>
</thead>
<tbody class="">
<tr class="dividend_tr" >
<td>1 </td>
<td style="text-align: center;">7 </td>
<td class="race_card_td"><h5 style="font-size:16px">
<a href="http://www.indiarace.com/Home/horseStatistics/55234/SILKEN
STRIKER">
SILKEN STRIKER </a></h5>
<h6 class="margin_remove">Sussex(GB)-Flying Rani </h6>
</td>
<td>
4y b g </td>
<td>
Irfan Ghatala </td>
<td>
Anjar Alam </td>
<td>
56 </td>
<td>
- </td>
<td>
6 </td>
<td>
A </td>
<td>
5 1/2 </td>
<td>
</td>
<td>
12 </td>
<td>
</td>
<td>
1:14.57 </td>
</tr>
<tr class="dividend_tr" >
<td>
2 </td>
<td style="text-align: center;">
5 </td>
<td class="race_card_td">
<h5 style="font-size:16px">
<a href="http://www.indiarace.com/Home/horseStatistics/55737/ULTIMATE
POWER">
ULTIMATE POWER </a>
</h5>
<h6 class="margin_remove">
Epicentre(USA)-Methodical </h6>
</td>
<td>
4y b g </td>
<td>
V Lokanath </td>
<td>
Darshan R N </td>
<td>
57 </td>
<td>
-1 </td>
<td>
3 </td>
<td>
A </td>
<td>
5 </td>
<td>
5.5 </td>
<td>
14 </td>
<td>
</td>
<td>
1:15.47 </td>
</tr>
</tbody>
</table>
</div>
I want the following output using Beautiful soup and want to store it in csv file. The actual page [http://www.indiarace.com/Home/racingCenterEvent?venueId=3&event_date=2018-08-10&race_type=RESULTS] has multiple tables and many rows. Also, I need to write a function to get data from different pages.
[Result][1]
[1]: https://i.stack.imgur.com/4LYt8.jpg
Any help would be greatful.
It's pretty simple you need find all tables then iterate tr and td as per your requirement. You can use pandas to save the scraped data. i have parse the tables for you (the rest you have to do)...check the code below.
import requests
from bs4 import BeautifulSoup
url = 'http://www.indiarace.com/Home/racingCenterEvent?venueId=3&event_date=2018-08-10&race_type=RESULTS'
html = requests.get(url)
soup = BeautifulSoup(html.content, 'html.parser')
table = soup.find_all('table', attrs={
'class':'result-table-new1'})
for i in table:
tr = i.find_all('tr')
for td in tr:
print(td.text.replace('\n', ' '))

Table extraction: BeautifulSoup vs. Pandas.read_html

I have an html file taken from this link, but I am not being able to extract any sort of table neither with bs4.BeautifulSoup() nor with pandas.read_html. I understand that each row of my desired table starts with <tr class='odd'>. Despite that, something is not working when I pass soup.find({'class': 'odd'}) or pd.read_html(url, attrs = {'class': 'odd'}). Where is the mistake or what should I do instead?
The beginning of the table apparently starts in requests.get(url).content[8359:].
<table style="background-color:#FFFEEE; border-width:thin; border-collapse:collapse; border-spacing:0; border-style:outset;" rules="groups" >
<colgroup>
<colgroup>
<colgroup>
<colgroup>
<colgroup span="3">
<colgroup span="3">
<colgroup span="3">
<colgroup span="3">
<colgroup>
<tbody>
<tr style="vertical-align:middle; background-color:#177A9C">
<th scope="col" style="text-align:center">Ion</th>
<th scope="col" style="text-align:center"> Observed <br /> Wavelength <br /> Vac (nm) </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>g<sub>k</sub>A<sub>ki</sub></i><br /> (10<sup>8</sup> s<sup>-1</sup>) </th>
<th scope="col"> Acc. </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>E<sub>i</sub></i> <br /> (eV) </th>
<th> </th>
<th scope="col" style="text-align:center; white-space:nowrap"> <i>E<sub>k</sub></i> <br /> (eV) </th>
<th scope="col" style="text-align:center" colspan="3"> Lower Level <br /> Conf., Term, J </th>
<th scope="col" style="text-align:center" colspan="3"> Upper Level <br /> Conf., Term, J </th>
<th scope="col" style="text-align:center"> <i>g<sub>i</sub></i> </th>
<th scope="col" style="text-align:center"> <b>-</b> </th>
<th scope="col" style="text-align:center"> <i>g<sub>k</sub></i> </th>
<th scope="col" style="text-align:center"> Type </th>
</tr>
</tbody>
<tbody>
<tr>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
<td> </td>
</tr>
<tr class='odd'>
<td class="lft1"><b>C I</b> </td>
<td class="fix"> 193.090540 </td>
<td class="lft1">1.02e+01 </td>
<td class="lft1"> A</td>
<td class="fix">1.2637284 </td>
<td class="dsh">- </td>
<td class="fix">7.68476771 </td>
<td class="lft1"> 2<i>s</i><sup>2</sup>2<i>p</i><sup>2</sup> </td>
<td class="lft1"> <sup>1</sup>D </td>
<td class="lft1"> 2 </td>
<td class="lft1"> 2<i>s</i><sup>2</sup>2<i>p</i>3<i>s</i> </td>
<td class="lft1"> <sup>1</sup>P° </td>
<td class="lft1"> 1 </td>
<td class="rgt"> 5</td>
<td class="dsh">-</td>
<td class="lft1">3 </td>
<td class="cnt"><sup></sup><sub></sub></td>
</tr>
This code can give you a jump start on this project, however, if you're looking for someone to build the whole project, request data, scrape, store, manipulate I would suggest hiring someone or learning how to do it. HERE is the BeautifulSoup Documentation.
Go through (the quickstart guide) it once and you'll pretty much be know all there is on bs4.
import requests
from bs4 import BeautifulSoup
from time import sleep
url = 'https://physics.nist.gov/'
second_part = 'cgi-bin/ASD/lines1.pl?spectra=C%20I%2C%20Ti%20I&limits_type=0&low_w=190&upp_w=250&unit=1&de=0&format=0&line_out=0&no_spaces=on&remove_js=on&en_unit=1&output=0&bibrefs=0&page_size=15&show_obs_wl=1&unc_out=0&order_out=0&max_low_enrg=&show_av=2&max_upp_enrg=&tsb_value=0&min_str=&A_out=1&A8=1&max_str=&allowed_out=1&forbid_out=1&min_accur=&min_intens=&conf_out=on&term_out=on&enrg_out=on&J_out=on&g_out=on&submit=Retrieve%20Data%27'
page = requests.get(url+second_part)
soup = BeautifulSoup(page.content, "lxml")
whole_table = soup.find('table', rules='groups')
sub_tbody = whole_table.find_all('tbody')
# the two above lines are used to locate the table and the content
# we then continue to iterate through sub-categories i.e. tbody-s > tr-s > td-s
for tag in sub_tbody:
if tag.find('tr').find('td'):
table_rows = tag.find_all('tr')
for tag2 in table_rows:
if tag2.has_attr('class'):
td_tags = tag2.find_all('td')
print(td_tags[0].text, '<- Is the ion')
print(td_tags[1].text, '<- Wavelength')
print(td_tags[2].text, '<- Some formula gk Aki')
# and so on...
print('--'*40) # unecessary but does print ----------...
else:
pass
You need to search for the tags and then the class. So using the lxml parser;
soup = BeautifulSoup(yourdata, 'lxml')
for i in soup.find_all('tr',attrs={'class':"odd"}):
print(i.text)
From this point you can write this data directly to a file or generate an array (list of lists - your rows) then put into pandas etc etc.

How to get text content of multiple <td> tags inside a table using PyQuery?

How to select attribute's text field from given book-details table field where values are in text or in text field?
<table cellspacing="0" class="fk-specs-type2">
<tr>
<th class="group-head" colspan="2">Book Details</th>
</tr>
<tr>
<td class="specs-key">Publisher</td>
<td class="specs-value fk-data">HARPER COLLINS INDIA</td>
</tr>
<tr>
<td class="specs-key">ISBN-13</td>
<td class="specs-value fk-data">9789350291924</td>
</tr>
</table>
You can use following code snippet to get Publisher and ISBN-13 data:
from pyquery import PyQuery
html = """<table cellspacing="0" class="fk-specs-type2">
<tr>
<th class="group-head" colspan="2">Book Details</th>
</tr>
<tr>
<td class="specs-key">Publisher</td>
<td class="specs-value fk-data">HARPER COLLINS INDIA</td>
</tr>
<tr>
<td class="specs-key">ISBN-13</td>
<td class="specs-value fk-data">9789350291924</td>
</tr>
</table>
"""
doc = PyQuery(html)
for td in doc("table.fk-specs-type2").find("td.specs-key"):
print td.text, td.getnext().text
It should print following two lines
Publisher HARPER COLLINS INDIA
ISBN-13 9789350291924

beautiful soup get children that are Tags (not Navigable Strings) from a Tag

Beautiful soup documentation provides attributes .contents and .children to access the children of a given tag (a list and an iterable respectively), and includes both Navigable Strings and Tags. I want only the children of type Tag.
I'm currently accomplishing this using list comprehension:
rows=[x for x in table.tbody.children if type(x)==bs4.element.Tag]
but I'm wondering if there is a better/more pythonic/built-in way to get just Tag children.
thanks to J.F.Sebastian , the following will work:
rows=table.tbody.find_all(True, recursive=False)
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#true
In my case, I needed actual rows in the table, so I ended up using the following, which is more precise and I think more readable:
rows=table.tbody.find_all('tr')
Again, docs: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names
I believe this is a better way than iterating through all the children of a Tag.
Worked with the following input:
<table cellspacing="0" cellpadding="0">
<thead>
<tr class="title-row">
<th class="title" colspan="100">
<div style="position:relative;">
President
<span class="pct-rpt">
99% reporting
</span>
</div>
</th>
</tr>
<tr class="header-row">
<th class="photo first">
</th>
<th class="candidate ">
Candidate
</th>
<th class="party ">
Party
</th>
<th class="votes ">
Votes
</th>
<th class="pct ">
Pct.
</th>
<th class="change ">
Change from ‘08
</th>
<th class="evotes last">
Electoral Votes
</th>
</tr>
</thead>
<tbody>
<tr class="">
<td class="photo first">
<div class="photo_wrap"><img alt="P-barack-obama" height="48" src="http://i1.nyt.com/projects/assets/election_2012/images/candidate_photos/election_night/p-barack-obama.jpg?1352320690" width="68" /></div>
</td>
<td class="candidate ">
<div class="winner dem"><img alt="Hp-checkmark#2x" height="9" src="http://i1.nyt.com/projects/assets/election_2012/images/swatches/hp-checkmark#2x.png?1352320690" width="10" />Barack Obama</div>
</td>
<td class="party ">
Dem.
</td>
<td class="votes ">
2,916,811
</td>
<td class="pct ">
57.3%
</td>
<td class="change ">
-4.6%
</td>
<td class="evotes last">
20
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Mitt Romney</div>
</td>
<td class="party ">
Rep.
</td>
<td class="votes ">
2,090,116
</td>
<td class="pct ">
41.1%
</td>
<td class="change ">
+4.3%
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Gary Johnson</div>
</td>
<td class="party ">
Lib.
</td>
<td class="votes ">
54,798
</td>
<td class="pct ">
1.1%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="last-row">
<td class="photo first">
</td>
<td class="candidate ">
div class="not-winner">Jill Stein</div>
</td>
<td class="party ">
Green
</td>
<td class="votes ">
29,336
</td>
<td class="pct ">
0.6%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr>
<td class="footer" colspan="100">
President Map |
President Big Board |
Exit Polls
</td>
</tr>
</tbody>
</table>

Categories