How to extract HTML table following a specific heading? - python

I am using BeautifulSoup to parse HTML files. I have a HTML file similar to this:
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key B</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>THE GOOD STUFF</h3>
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
I want to extract the string "I WANT THIS STRING". The perfect solution would be to get the first table following the h3 heading called "THE GOOD STUFF". I have no idea how to do this with BeautifulSoup - I only know how to extract a table with a specific class, or a table nested within some particular tag, but not following a particular tag.
I think a fallback solution could make use of the string "Key C", assuming it's unique (it almost certainly is) and appears in only that one table, but I'd feel better with going for the specific h3 heading.

Following the logic of #Zroq's answer on another question, this code will give you the table following your defined header ("THE GOOD STUFF"). Please note I just put all your html in the variable called "html".
from bs4 import BeautifulSoup, NavigableString, Tag
soup=BeautifulSoup(html, "lxml")
for header in soup.find_all('h3', text=re.compile('THE GOOD STUFF')):
nextNode = header
while True:
nextNode = nextNode.nextSibling
if nextNode is None:
break
if isinstance(nextNode, Tag):
if nextNode.name == "h3":
break
print(nextNode)
Output:
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
Cheers!

The docs explain that if you don't want to use find_all, you can do this:
for sibling in soup.a.next_siblings:
print(repr(sibling))

I am sure there are many ways to this more efficiently, but here is what I can think about right now:
from bs4 import BeautifulSoup
import os
os.chdir('/Users/Downloads/')
html_data = open("/Users/Downloads/train.html",'r').read()
soup = BeautifulSoup(html_data, 'html.parser')
all_td = soup.find_all("td")
flag = 'no_print'
for td in all_td:
if flag == 'print':
print(td.text)
break
if td.text == 'Key C':
flag = 'print'
Output:
I WANT THIS STRING

Related

Select a specific column and ignore the rest in BeautifulSoup Python (Avoiding nested tables)

I'm trying to get only the first two columns of a webpage table using beautifulsoup in python. The problem is that this table sometimes contains nested tables in the third column. The structure of the html is similar to this:
<table class:"relative-table wrapped">
<tbody>
<tr>
<td>
<\td>
<td>
<\td>
<td>
<\td>
<\tr>
<tr>
<td>
<\td>
<td>
<\td>
<td>
<div class="table-wrap">
<table class="relative-table wrapped">
...
...
<\table>
<\div>
<\td>
<\tr>
<\tbody>
<\table>
The main problem is that I don't know how to simply ignore every third td so I don't read the nested tables inside the main one. I just want to have a list with the first column of the main table and another list with the second column of the main table but the nested table ruins everything when I'm reading.
I have tried with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
links = soup.select("table.relative-table tbody tr td.confluenceTd")
anchorList = []
for anchor in links:
anchorList.append(anchor.text)
del anchorList[2:len(anchorList):3]
for anchorItem in anchorList:
print(anchorItem)
print('-------------------')
This works really good until I reach the nested table and then it starts deleting other columns.
I have also tried this other code:
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for row in soup.findAll('table')[0].tbody.findAll('tr'):
firstColumn = row.findAll('td')[0].contents
secondColumn = row.findAll('td')[1].contents
print(firstColumn, secondColumn)
But I get an IndexError because it's reading the nested tabble and the nested table only has one td.
Does anyone knows how could I read the first two columns and ignore the rest?
Thank you.
It may needs some improved examples / details to clarify, but as I understand you are selecting the first <table> and try to iterate its rows:
soup.table.select('tr:not(:has(table))')
Above selector would exclude all thr rows that includes an additional <table>.
Alternative would be to get rid of these last/third <td> :
for row in soup.table.select('tr'):
row.select_one('td:last-of-type').decompose()
#### or by its index row.select_one('td:nth-of-type(3)').decompose()
Now you could perform your selections on a <table> with two columns.
Example
from bs4 import BeautifulSoup
html ='''
<table class:"relative-table wrapped">
<tbody>
<tr>
<td>
</td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
<td>
<div class="table-wrap">
<table class="relative-table wrapped">
...
...
</table>
</div>
</td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(html, 'html.parser')
for row in soup.table.select('tr'):
row.select_one('td:last-of-type').decompose()
soup
New soup
<table class:"relative-table="" wrapped"="">
<tbody>
<tr>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
</tr>
</tbody>
</table>

Use beautifulSoup to find a table after a header?

I am trying to scrape some data off a website. The data that I want is listed in a table, but there are multiple tables and no ID's. I then had the idea that I would find the header just above the table I was searching for and then use that as an indicator.
This has really troubled me, so as a last resort, I wanted to ask if there were someone who knows how to BeautifulSoup to find the table.
A snipped of the HTML code is provided beneath, thanks in advance :)
The table I am interested in, is the table right beneath <h2>Mine neaste vagter</h2>
<h2>Min aktuelle vagt</h2>
<div>
<a href='/shifts/detail/595212/'>Flere detaljer</a>
<p>Vagt starter: <b>11/06 2021 - 07:00</b></p>
<p>Vagt slutter: <b>11/06 2021 - 11:00</b></p>
<h2>Masker</h2>
<table class='list'>
<tr><th>Type</th><th>Fra</th><th> </th><th>Til</th></tr>
<tr>
<td>Fri egen regningD</td>
<td>07:00</td>
<td> - </td>
<td>11:00</td>
</tr>
</table>
</div>
<hr>
<h2>Mine neaste vagter</h2>
<table class='list'>
<tr>
<th class="alignleft">Dato</th>
<th class="alignleft">Rolle</th>
<th class="alignleft">Tidsrum</th>
<th></th>
<th class="alignleft">Bytte</th>
<th class="alignleft" colspan='2'></th>
</tr>
<tr class="rowA separator">
<td>
<h3>12/6</h3>
</td>
<td>Kundeservice</td>
<td>18:00 → 21:30 (3.5 t)</td>
<td style="max-width: 20em;"></td>
<td>
<a href="/shifts/ajax/popup/595390/" class="swap shiftpop">
Byt denne vagt
</a>
</td>
<td><a href="/shifts/detail/595390/">Detaljer</td>
<td>
</td>
</tr>
Here are two approaches to find the correct <table>:
Since the table you want is the last one in the HTML, you can use find_all() and using index slicing [-1] to find the last table:
print(soup.find_all("table", class_="list")[-1])
Find the h2 element by text, and the use the find_next() method to find the table:
print(soup.find(lambda tag: tag.name == "h2" and "Mine neaste vagter" in tag.text).find_next("table"))
You can use :-soup-contains (or just :contains) to target the <h2> by its text and then use find_next to move to the table:
from bs4 import BeautifulSoup as bs
html = '''your html'''
soup = bs(html, 'lxml')
soup.select_one('h2:-soup-contains("Mine neaste vagter")').find_next('table')
This is assuming the HTML, as shown, is returned by whatever access method you are using.

How do I read text in HTML table cells using python and selenium?

I've a table on a website much like this:
<table class="table-class">
<thead>
<tr>
<th>Col 1</th>
<th>Col 2</th>
<th>Col 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hello</td>
<td>A number</td>
<td>Another number<td>
</tr>
<tr>
<td>there</td>
<td>A number</td>
<td>Another number<td>
</tr>
</tbody>
</table>
Ultimately, what I would like to do is to read the content of each td for each row and produce a string containing all three cells for each respective row. Furthermore, I would like for this to scale to handle larger tables from numerous websites using the same design, so speed is somewhat of a priority, but not a necessity.
I assume I have to use something like find_elements_by_xpath(...) or similar, but I'm really hitting a wall with this. I've attempted several approaches suggested on other sites and seem to do more things wrong than right. Any sort of suggestion or idea would be hugely appreciated!
What I currently have, although non-functioning and based on another question from here, is:
listoflist = [[td.text
for td in tr.find_elements_by_xpath('td')]
for tr in driver.find_elements_by_xpath("//table[#class='table-class')]//tr"]
listofdict = [dict(zip(list_of_lists[0],row)) for row in list_of_lists[1:]]
Thanks in advance!
vham
Depending on the website you are trying to access, you might not need to go as far as needing selenium. You could just access the html using requests.
For the HTML you have given, uou could use BeautifulSoup to extract the table information as follows:
from bs4 import BeautifulSoup
html = """
<table class="table-class">
<thead>
<tr>
<th>Col 1</th>
<th>Col 2</th>
<th>Col 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Hello</td>
<td>A number</td>
<td>Another number<td>
</tr>
<tr>
<td>there</td>
<td>A number</td>
<td>Another number<td>
</tr>
</tbody>
</table>"""
soup = BeautifulSoup(html, "html.parser")
rows = []
for tr in soup.find_all('tr'):
cols = []
for td in tr.find_all(['td', 'th']):
td_text = td.get_text(strip=True)
if len(td_text):
cols.append(td_text)
rows.append(cols)
print rows
Giving you rows holding:
[[u'Col 1', u'Col 2', u'Col 3'], [u'Hello', u'A number', u'Another number'], [u'there', u'A number', u'Another number']]
To use requests, it would start something like:
import requests
response = requests.get(url)
html = response.text
If you are familiar with DOM (Document Object Model), then you can use the answers in this post and use BeautifulSoup library to load html in DOM format. After that you can simply find instances of <tr> and foreach one of those instances find all the respective <td> tags inside. Think of DOM as a tree structure where branching happens at nested tags.

Can't scrape HTML table using BeautifulSoup

I'm trying to scrape data off a table on a web page using Python, BeautifulSoup, Requests, as well as Selenium to log into the site.
Here's the table I'm looking to get data for...
<div class="sastrupp-class">
<table>
<tbody>
<tr>
<td class="key">Thing I dont want 1</td>
<td class="value money">$1.23</td>
<td class="key">Thing I dont want 2</td>
<td class="value">99,999,999</td>
<td class="key">Target</td>
<td class="money value">$1.23</td>
<td class="key">Thing I dont want 3</td>
<td class="money value">$1.23</td>
<td class="key">Thing I dont want 4</td>
<td class="value percentage">1.23%</td>
<td class="key">Thing I dont want 5</td>
<td class="money value">$1.23</td>
</tr>
</tbody>
</table>
</div>
I can find the "sastrupp-class" fine, but I don't know how to look through it and get to the part of the table I want.
I figured I could just look for the class that I'm searching for like this...
output = soup.find('td', {'class':'key'})
print(output)
but that doesn't return anything.
Important to note:
< td>s inside the table have the same class name as the one that I want. If I can't separate them out, I'm ok with that although I'd rather just return the one I want.
2.There are other < div>s with class="sastrupp-class" on the site.
I'm obviously a beginner at this so let me know if I can help you help me.
Any help/pointers would be appreciated.
1) First of, to get your 'Target' you need find_all, not find. Then, considering you know exactly in which position your target will be (in the example you gave it is index=2) the solution could be reached like this:
from bs4 import BeautifulSoup
html = """(YOUR HTML)"""
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('div', {'class': 'sastrupp-class'})
all_keys = table.find_all('td', {'class': 'key'})
my_key = all_keys[2]
print my_key.text # prints 'Target'
2)
There are other < div>s with class="sastrupp-class" on the site
Again, you need to select the one you need using find_all and then selecting the correct index.
Example HTML:
<body>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Don't need this</div>
<div class="sastrupp-class"> Target</div>
</body>
To extract the target, you can just:
all_divs = soup.find_all('div', {'class':'sastrupp-class'})
target = all_divs[3] # assuming you know exactly which index to look for

Beautifulsoup Unable to Find Classes with Hyphens in Their Name

I am using BeautifulSoup4 on a MacOSX running Python 2.7.8. I am having difficulty extracting information from the following html code
<tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<td headers="yui-dt0-th-rank" class="rank yui-dt0-col-rank"></td>
</tr>
<tr id="yui-rec1" class="yui-dt-odd">...</tr>
<tr id="yui-rec2" class="yui-dt-even">...</tr>
</tbody>
I can't seem to grab the table or any of it's contents because BS and/or python doesn't seem to recognize values with hyphens. So the usual code, something like
Table = soup.find('tbody',{'class':'yui-dt-data'})
or
Row2 = Table.find('tr',{'id':'yui-rec2'})
just returns an empty object (not NONE, simply empty). I'm not new to BS4 or Python and I've extracted information from this site before, but the class names are different now than when I previously did it. Now everything has hyphens. Is there any way to get Python to recognize the hyphen or a workaround?
I need to have my code be general so that I can run it across numerous pages that all have the same class name. Unfortunately, the id attribute in <tbody> is unique to that particular table, so I can't use that to identify this table across webpages.
Any help would be appreciated. Thanks in advance.
The following code:
from bs4 import BeautifulSoup
htmlstring = """ <tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<tr id="yui-rec1" class="yui-dt-odd">
<tr id="yui-rec2" class="yui-dt-even">"""
soup = BeautifulSoup(htmlstring)
Table = soup.find('tbody', attrs={'class': 'yui-dt-data'})
print("Table:\n")
print(Table)
tr = Table.find('tr', attrs={'class': 'yui-dt-odd'})
print("tr:\n")
print(tr)
outputs:
Table:
<tbody class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650" tabindex="0">
<tr class="yui-dt-first yui-dt-even" id="yui-rec0">
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2"></tr></tr></tr></tbody>
tr:
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2"></tr></tr>
Even though the html you supplied isn't by itself valid, it seems that BS is making a guess about how it should be, because soup.prettify() yields
<tbody class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650" tabindex="0">
<tr class="yui-dt-first yui-dt-even" id="yui-rec0">
<tr class="yui-dt-odd" id="yui-rec1">
<tr class="yui-dt-even" id="yui-rec2">
</tr>
</tr>
</tr>
</tbody>
Though I'm guessing those tr's aren't supposed to be nested.
Could you try running that exact code and seeing what the output is?
For people trying to find a solution to find a tag with hyphen in its attributes, there is an answer in the document
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments
This segment of code will cause error
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
you should do this
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
Just use select. bs4 4.7.1
import requests
from bs4 import BeautifulSoup as bs
html = '''
<tbody tabindex="0" class="yui-dt-data" id="yui_3_5_0_1_1408418470185_1650">
<tr id="yui-rec0" class="yui-dt-first yui-dt-even">
<td headers="yui-dt0-th-rank" class="rank yui-dt0-col-rank"></td>
</tr>
<tr id="yui-rec1" class="yui-dt-odd">...</tr>
<tr id="yui-rec2" class="yui-dt-even">...</tr>
</tbody>
'''
soup = bs(html, 'lxml')
soup.select('.yui-dt-data')

Categories