Use beautifulSoup to find a table after a header?

Use beautifulSoup to find a table after a header? - python

I am trying to scrape some data off a website. The data that I want is listed in a table, but there are multiple tables and no ID's. I then had the idea that I would find the header just above the table I was searching for and then use that as an indicator.
This has really troubled me, so as a last resort, I wanted to ask if there were someone who knows how to BeautifulSoup to find the table.
A snipped of the HTML code is provided beneath, thanks in advance :)
The table I am interested in, is the table right beneath <h2>Mine neaste vagter</h2>
<h2>Min aktuelle vagt</h2>
<div>
<a href='/shifts/detail/595212/'>Flere detaljer</a>
<p>Vagt starter: <b>11/06 2021 - 07:00</b></p>
<p>Vagt slutter: <b>11/06 2021 - 11:00</b></p>
<h2>Masker</h2>
<table class='list'>
<tr><th>Type</th><th>Fra</th><th> </th><th>Til</th></tr>
<tr>
<td>Fri egen regningD</td>
<td>07:00</td>
<td> - </td>
<td>11:00</td>
</tr>
</table>
</div>
<hr>
<h2>Mine neaste vagter</h2>
<table class='list'>
<tr>
<th class="alignleft">Dato</th>
<th class="alignleft">Rolle</th>
<th class="alignleft">Tidsrum</th>
<th></th>
<th class="alignleft">Bytte</th>
<th class="alignleft" colspan='2'></th>
</tr>
<tr class="rowA separator">
<td>
<h3>12/6</h3>
</td>
<td>Kundeservice</td>
<td>18:00 → 21:30 (3.5 t)</td>
<td style="max-width: 20em;"></td>
<td>
<a href="/shifts/ajax/popup/595390/" class="swap shiftpop">
Byt denne vagt
</a>
</td>
<td><a href="/shifts/detail/595390/">Detaljer</td>
<td>
</td>
</tr>

Here are two approaches to find the correct <table>:
Since the table you want is the last one in the HTML, you can use find_all() and using index slicing [-1] to find the last table:
print(soup.find_all("table", class_="list")[-1])
Find the h2 element by text, and the use the find_next() method to find the table:
print(soup.find(lambda tag: tag.name == "h2" and "Mine neaste vagter" in tag.text).find_next("table"))

You can use :-soup-contains (or just :contains) to target the <h2> by its text and then use find_next to move to the table:
from bs4 import BeautifulSoup as bs
html = '''your html'''
soup = bs(html, 'lxml')
soup.select_one('h2:-soup-contains("Mine neaste vagter")').find_next('table')
This is assuming the HTML, as shown, is returned by whatever access method you are using.

Related

Select a specific column and ignore the rest in BeautifulSoup Python (Avoiding nested tables)

I'm trying to get only the first two columns of a webpage table using beautifulsoup in python. The problem is that this table sometimes contains nested tables in the third column. The structure of the html is similar to this:
<table class:"relative-table wrapped">
<tbody>
<tr>
<td>
<\td>
<td>
<\td>
<td>
<\td>
<\tr>
<tr>
<td>
<\td>
<td>
<\td>
<td>
<div class="table-wrap">
<table class="relative-table wrapped">
...
...
<\table>
<\div>
<\td>
<\tr>
<\tbody>
<\table>
The main problem is that I don't know how to simply ignore every third td so I don't read the nested tables inside the main one. I just want to have a list with the first column of the main table and another list with the second column of the main table but the nested table ruins everything when I'm reading.
I have tried with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
links = soup.select("table.relative-table tbody tr td.confluenceTd")
anchorList = []
for anchor in links:
anchorList.append(anchor.text)
del anchorList[2:len(anchorList):3]
for anchorItem in anchorList:
print(anchorItem)
print('-------------------')
This works really good until I reach the nested table and then it starts deleting other columns.
I have also tried this other code:
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for row in soup.findAll('table')[0].tbody.findAll('tr'):
firstColumn = row.findAll('td')[0].contents
secondColumn = row.findAll('td')[1].contents
print(firstColumn, secondColumn)
But I get an IndexError because it's reading the nested tabble and the nested table only has one td.
Does anyone knows how could I read the first two columns and ignore the rest?
Thank you.

It may needs some improved examples / details to clarify, but as I understand you are selecting the first <table> and try to iterate its rows:
soup.table.select('tr:not(:has(table))')
Above selector would exclude all thr rows that includes an additional <table>.
Alternative would be to get rid of these last/third <td> :
for row in soup.table.select('tr'):
row.select_one('td:last-of-type').decompose()
#### or by its index row.select_one('td:nth-of-type(3)').decompose()
Now you could perform your selections on a <table> with two columns.
Example
from bs4 import BeautifulSoup
html ='''
<table class:"relative-table wrapped">
<tbody>
<tr>
<td>
</td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
<td>
<div class="table-wrap">
<table class="relative-table wrapped">
...
...
</table>
</div>
</td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(html, 'html.parser')
for row in soup.table.select('tr'):
row.select_one('td:last-of-type').decompose()
soup
New soup
<table class:"relative-table="" wrapped"="">
<tbody>
<tr>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
</tr>
</tbody>
</table>

How to extract HTML table following a specific heading?

I am using BeautifulSoup to parse HTML files. I have a HTML file similar to this:
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key B</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
<h3>THE GOOD STUFF</h3>
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
<h3>Unimportant heading</h3>
<table class="foo">
<tr>
<td>Key A</td>
</tr>
<tr>
<td>A value I don't want</td>
</tr>
</table>
I want to extract the string "I WANT THIS STRING". The perfect solution would be to get the first table following the h3 heading called "THE GOOD STUFF". I have no idea how to do this with BeautifulSoup - I only know how to extract a table with a specific class, or a table nested within some particular tag, but not following a particular tag.
I think a fallback solution could make use of the string "Key C", assuming it's unique (it almost certainly is) and appears in only that one table, but I'd feel better with going for the specific h3 heading.

Following the logic of #Zroq's answer on another question, this code will give you the table following your defined header ("THE GOOD STUFF"). Please note I just put all your html in the variable called "html".
from bs4 import BeautifulSoup, NavigableString, Tag
soup=BeautifulSoup(html, "lxml")
for header in soup.find_all('h3', text=re.compile('THE GOOD STUFF')):
nextNode = header
while True:
nextNode = nextNode.nextSibling
if nextNode is None:
break
if isinstance(nextNode, Tag):
if nextNode.name == "h3":
break
print(nextNode)
Output:
<table class="foo">
<tr>
<td>Key C</td>
</tr>
<tr>
<td>I WANT THIS STRING</td>
</tr>
</table>
Cheers!

The docs explain that if you don't want to use find_all, you can do this:
for sibling in soup.a.next_siblings:
print(repr(sibling))

I am sure there are many ways to this more efficiently, but here is what I can think about right now:
from bs4 import BeautifulSoup
import os
os.chdir('/Users/Downloads/')
html_data = open("/Users/Downloads/train.html",'r').read()
soup = BeautifulSoup(html_data, 'html.parser')
all_td = soup.find_all("td")
flag = 'no_print'
for td in all_td:
if flag == 'print':
print(td.text)
break
if td.text == 'Key C':
flag = 'print'
Output:
I WANT THIS STRING

How to extract the innerText of a <td> element with respect to the innerText of another <td> element

I am using selenium in python. I have come across this table webelement. I need to check if a string is present in the webelement and return a corresponding string in case its present.
<table width="700px" class="tableListGrid">
<thead>
<tr class="tableInfoTrBox">
<th>Date</th>
<th>Task Code</th>
<!-- th>Phone Number</th -->
<th>Fota Job</th>
<th colspan="2" class="thLineEnd">Task Description</th>
</tr>
</thead>
<tbody>
<tr class="tableTr_r">
<td>2018-04-06 05:48:29</td>
<td>FU</td>
<!-- td></td -->
<td>
57220180406-JSA69596727
</td>
<td style="text-align:left;">
updated from [A730FXXU1ARAB/A730FOJM1ARAB/A730FXXU1ARAB] to [A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9]
</td>
<td>
<table class="btnTypeE">
<tr>
<td>
View
</td>
</tr>
</table>
</td>
</tr>
</tbody>
</table>
I need to search for "A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9" in this element and return "57220180406-JSA69596727" which is present in same row at a different place in the web page. Is it possible to do in selenium ?
EDIT: Cleaned the code to only contain useful data.

It can be achieved by finding the element using the following Xpath:
//td[contains(., 'A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9')]/preceding-sibling::td[1]/a
Xpath can be read as
find td which contains "A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9". Then find the td
preceding the found td and move to the a tag
After this you can get text using selenium
driver.find_element(By.XPATH, '//td[contains(., 'A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9')]/preceding-sibling::td[1]/a').text

To look out for a text e.g.A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9 and find out an associated text e.g. 57220180406-JSA69596727, you can write a function as follows :
def test_me(myString):
myText = driver.find_element_by_xpath("//table[#class='tableListGrid']//tbody/tr[#class='tableTr_r']//td[.='" + myString + "']//preceding::td[1]/a").get_attribute("innerHTML")
Now, from your main()/#Test you can call the function with the desired text as follows :
test_me("A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9")

Extracting string from an HTML table from a given tab using Python

I need to extract a string value from the HTML table below. I want to loop through the table from a particular tab, and copy the results horizontally into the command line or some file.
I am pasting only one row of information here.
This table gets updated based on the changes happening on the Gerrits.
The result that I want is all the Gerrit number under a new tab
For example, if I want to get the Gerrit list from the approval queue, the values should display as shown in the image below.
7897423, 2423343, 34242342, 34234, 57575675
<ul>
<li><span>Review Queue</span></li>
<li><span>Approval Queue</span></li>
<li><span>Verification Queue</span></li>
<li><span>Merge Queue</span></li>
<li><span>Open Queue</span></li>
<li><span>Failed verification</span></li>
</ul>
<div id="tab1">
<h1>Review Queue</h1>
<table class="tablesorter" id="dashboardTable">
<thead>
<tr>
<th></th>
<th>Gerrit</th>
<th>Owner</th>
<th>CR(s)</th>
<th>Project</th>
<th>Dev Branch/PL</th>
<th>Subject</th>
<th>Status</th>
<th>Days in Queue</th>
</tr>
</thead>
<tbody>
<tr>
<td><input type="checkbox" /></td>
<td> 1696771 </td>
<td> ponga </td>
<td> 1055680 </td>
<td>platform/hardware/kiosk/</td>
<td> hidden-userspace.aix.2.0.dev </td>
<td>display: information regarding display</td>
<td> some info here </td>
<td> 2 </td>
</tr>

What stops you from leveraging BeautifulSoup for this?
Lets say you have already read the html (using sgmllib or any other library) into a string variable named html_contents.
Since you are not mentioning which column you want to get data from, I am extracting the gerrit number column.
You can simply do:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for tr in soup.tbody:
if tr.name == 'tr':
print tr.contents[3].contents[1].string
Above you can loop in all the tr tags inside the tbody and (assuming all the td tags contained inside the tr have the same structure) their value is extracted, in this case the value of the a tag inside.
Read the quick start, it will make your life easier on parsing HTML documents.

Parsing an HTML file with selectorgadget.com

How can I use beautiful soup and selectorgadget to scrape a website. For example I have a website - (a newegg product) and I would like my script to return all of the specifications of that product (click on SPECIFICATIONS) by this I mean - Intel, Desktop, ......, 2.4GHz, 1066Mhz, ...... , 3 years limited.
After using selectorgadget I get the string-
.desc
How do I use this?
Thanks :)

Inspecting the page, I can see that the specifications are placed in a div with the ID pcraSpecs:
<div id="pcraSpecs">
<script type="text/javascript">...</script>
<TABLE cellpadding="0" cellspacing="0" class="specification">
<TR>
<TD colspan="2" class="title">Model</TD>
</TR>
<TR>
<TD class="name">Brand</TD>
<TD class="desc"><script type="text/javascript">document.write(neg_specification_newline('Intel'));</script></TD>
</TR>
<TR>
<TD class="name">Processors Type</TD>
<TD class="desc"><script type="text/javascript">document.write(neg_specification_newline('Desktop'));</script></TD>
</TR>
...
</TABLE>
</div>
desc is the class of the table cells.
What you want to do is to extract the contents of this table.
soup.find(id="pcraSpecs").findAll("td") should get you started.

Have you tried using Feedity - http://feedity.com for creating a custom RSS feed from any webpage.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use beautifulSoup to find a table after a header? - python

Related

Select a specific column and ignore the rest in BeautifulSoup Python (Avoiding nested tables)

How to extract HTML table following a specific heading?

How to extract the innerText of a <td> element with respect to the innerText of another <td> element

Extracting string from an HTML table from a given tab using Python

Parsing an HTML file with selectorgadget.com

Categories

Resources