Extracting string from an HTML table from a given tab using Python

Extracting string from an HTML table from a given tab using Python - python

I need to extract a string value from the HTML table below. I want to loop through the table from a particular tab, and copy the results horizontally into the command line or some file.
I am pasting only one row of information here.
This table gets updated based on the changes happening on the Gerrits.
The result that I want is all the Gerrit number under a new tab
For example, if I want to get the Gerrit list from the approval queue, the values should display as shown in the image below.
7897423, 2423343, 34242342, 34234, 57575675
<ul>
<li><span>Review Queue</span></li>
<li><span>Approval Queue</span></li>
<li><span>Verification Queue</span></li>
<li><span>Merge Queue</span></li>
<li><span>Open Queue</span></li>
<li><span>Failed verification</span></li>
</ul>
<div id="tab1">
<h1>Review Queue</h1>
<table class="tablesorter" id="dashboardTable">
<thead>
<tr>
<th></th>
<th>Gerrit</th>
<th>Owner</th>
<th>CR(s)</th>
<th>Project</th>
<th>Dev Branch/PL</th>
<th>Subject</th>
<th>Status</th>
<th>Days in Queue</th>
</tr>
</thead>
<tbody>
<tr>
<td><input type="checkbox" /></td>
<td> 1696771 </td>
<td> ponga </td>
<td> 1055680 </td>
<td>platform/hardware/kiosk/</td>
<td> hidden-userspace.aix.2.0.dev </td>
<td>display: information regarding display</td>
<td> some info here </td>
<td> 2 </td>
</tr>

What stops you from leveraging BeautifulSoup for this?
Lets say you have already read the html (using sgmllib or any other library) into a string variable named html_contents.
Since you are not mentioning which column you want to get data from, I am extracting the gerrit number column.
You can simply do:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for tr in soup.tbody:
if tr.name == 'tr':
print tr.contents[3].contents[1].string
Above you can loop in all the tr tags inside the tbody and (assuming all the td tags contained inside the tr have the same structure) their value is extracted, in this case the value of the a tag inside.
Read the quick start, it will make your life easier on parsing HTML documents.

Related

Select a specific column and ignore the rest in BeautifulSoup Python (Avoiding nested tables)

I'm trying to get only the first two columns of a webpage table using beautifulsoup in python. The problem is that this table sometimes contains nested tables in the third column. The structure of the html is similar to this:
<table class:"relative-table wrapped">
<tbody>
<tr>
<td>
<\td>
<td>
<\td>
<td>
<\td>
<\tr>
<tr>
<td>
<\td>
<td>
<\td>
<td>
<div class="table-wrap">
<table class="relative-table wrapped">
...
...
<\table>
<\div>
<\td>
<\tr>
<\tbody>
<\table>
The main problem is that I don't know how to simply ignore every third td so I don't read the nested tables inside the main one. I just want to have a list with the first column of the main table and another list with the second column of the main table but the nested table ruins everything when I'm reading.
I have tried with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
links = soup.select("table.relative-table tbody tr td.confluenceTd")
anchorList = []
for anchor in links:
anchorList.append(anchor.text)
del anchorList[2:len(anchorList):3]
for anchorItem in anchorList:
print(anchorItem)
print('-------------------')
This works really good until I reach the nested table and then it starts deleting other columns.
I have also tried this other code:
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for row in soup.findAll('table')[0].tbody.findAll('tr'):
firstColumn = row.findAll('td')[0].contents
secondColumn = row.findAll('td')[1].contents
print(firstColumn, secondColumn)
But I get an IndexError because it's reading the nested tabble and the nested table only has one td.
Does anyone knows how could I read the first two columns and ignore the rest?
Thank you.

It may needs some improved examples / details to clarify, but as I understand you are selecting the first <table> and try to iterate its rows:
soup.table.select('tr:not(:has(table))')
Above selector would exclude all thr rows that includes an additional <table>.
Alternative would be to get rid of these last/third <td> :
for row in soup.table.select('tr'):
row.select_one('td:last-of-type').decompose()
#### or by its index row.select_one('td:nth-of-type(3)').decompose()
Now you could perform your selections on a <table> with two columns.
Example
from bs4 import BeautifulSoup
html ='''
<table class:"relative-table wrapped">
<tbody>
<tr>
<td>
</td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
<td>
<div class="table-wrap">
<table class="relative-table wrapped">
...
...
</table>
</div>
</td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(html, 'html.parser')
for row in soup.table.select('tr'):
row.select_one('td:last-of-type').decompose()
soup
New soup
<table class:"relative-table="" wrapped"="">
<tbody>
<tr>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
</tr>
</tbody>
</table>

Finding sibling items between headers

I'm trying to scrape some documentation for data files composed in XML. I had been writing the XSD manually by reading the pages and typing it out, but it has occurred to me that this is a prime case for page scraping. The general format (as far as I can tell based on a random sample) is something like the following:
<h2>
<span class="mw-headline" id="The_.22show.22_Child_Element">
The "show" Child Element
</span>
</h2>
<dl>
<dd>
<table class="infotable">
<tr>
<td class="leftnormal">
allowstack
</td>
<td>
(Optional) Boolean – Indicates whether the user is allowed to stack items within the table, subject to the restrictions imposed for each item. Default: "yes".
</td>
</tr>
<tr>
<td>
agentlist
</td>
<td>
(Optional) Id – If set to a tag group id, only picks with the agent pick's identity tag from that group are shown in the table. Default: None.
</td>
</tr>
<tr>
<td>
allowmove
</td>
<td>
(Optional) Boolean – Indicates whether picks in this table can be moved out of this table, if the user drags them around. Default: "yes".
</td>
</tr>
<tr>
<td>
listpick
</td>
<td>
(Optional) Id – Unique id of the pick to take the table's list expression from (see listfield, below). Note that this does not work when used with portals. Default: None.
</td>
</tr>
<tr>
<td>
listfield
</td>
<td>
(Optional) Id – Unique id of the field to take the table's list expression from (see listpick, above). Note that this does not work when used with portals. Default: None.
</td>
</tr>
</table>
</dd>
</dl>
<p>
The "show" element also possesses child elements that define additional behaviors of the table. The list of these child elements is below and must appear in the order shown. Click on the link to access the details for each element.
</p>
<dl>
<dd>
<table class="infotable">
<tr>
<td class="leftnormal">
<a href="index.php5#title=TableDef_Element_(Data).html#list">
list
</a>
</td>
<td>
An optional "list" element may appear as defined by the given link. This element defines a
<a href="index.php5#title=List_Tag_Expression.html" title="List Tag Expression">
List Tag Expression
</a>
for the table.
</td>
</tr>
</table>
</dd>
</dl>
There's a pretty clear pattern of each file having a number of elements defined by a header followed by text followed by a table (generally the attributes), and possibly another set of text and a table (for the child elements). I think I can reach a reasonable solution by simply using next or next-sibling to step through items and trying to scan the text to determine if the following table is attributes or classes, but it feels a bit weird that I can't just grab everything in between two header tags and then scan that.

You can search for multiple elements at the same time, for example <h2> and <table>. You can then make a note of each <h2> contents before processing each <table>.
For example:
soup = BeautifulSoup(html, "html.parser")
for el in soup.find_all(['h2', 'table']):
if el.name == 'h2':
h2 = el.get_text(strip=True)
h2_id = el.span['id']
else:
for tr in el.find_all('tr'):
row = [td.get_text(strip=True) for td in tr.find_all('td')]
print([h2, h2_id, *row])

Use beautifulSoup to find a table after a header?

I am trying to scrape some data off a website. The data that I want is listed in a table, but there are multiple tables and no ID's. I then had the idea that I would find the header just above the table I was searching for and then use that as an indicator.
This has really troubled me, so as a last resort, I wanted to ask if there were someone who knows how to BeautifulSoup to find the table.
A snipped of the HTML code is provided beneath, thanks in advance :)
The table I am interested in, is the table right beneath <h2>Mine neaste vagter</h2>
<h2>Min aktuelle vagt</h2>
<div>
<a href='/shifts/detail/595212/'>Flere detaljer</a>
<p>Vagt starter: <b>11/06 2021 - 07:00</b></p>
<p>Vagt slutter: <b>11/06 2021 - 11:00</b></p>
<h2>Masker</h2>
<table class='list'>
<tr><th>Type</th><th>Fra</th><th> </th><th>Til</th></tr>
<tr>
<td>Fri egen regningD</td>
<td>07:00</td>
<td> - </td>
<td>11:00</td>
</tr>
</table>
</div>
<hr>
<h2>Mine neaste vagter</h2>
<table class='list'>
<tr>
<th class="alignleft">Dato</th>
<th class="alignleft">Rolle</th>
<th class="alignleft">Tidsrum</th>
<th></th>
<th class="alignleft">Bytte</th>
<th class="alignleft" colspan='2'></th>
</tr>
<tr class="rowA separator">
<td>
<h3>12/6</h3>
</td>
<td>Kundeservice</td>
<td>18:00 → 21:30 (3.5 t)</td>
<td style="max-width: 20em;"></td>
<td>
<a href="/shifts/ajax/popup/595390/" class="swap shiftpop">
Byt denne vagt
</a>
</td>
<td><a href="/shifts/detail/595390/">Detaljer</td>
<td>
</td>
</tr>

Here are two approaches to find the correct <table>:
Since the table you want is the last one in the HTML, you can use find_all() and using index slicing [-1] to find the last table:
print(soup.find_all("table", class_="list")[-1])
Find the h2 element by text, and the use the find_next() method to find the table:
print(soup.find(lambda tag: tag.name == "h2" and "Mine neaste vagter" in tag.text).find_next("table"))

You can use :-soup-contains (or just :contains) to target the <h2> by its text and then use find_next to move to the table:
from bs4 import BeautifulSoup as bs
html = '''your html'''
soup = bs(html, 'lxml')
soup.select_one('h2:-soup-contains("Mine neaste vagter")').find_next('table')
This is assuming the HTML, as shown, is returned by whatever access method you are using.

Scraping Multi-Row TD With Nested SPAN tags

For the long story short part, just discovered BeautifulSoup yesterday, haven't done scripting or coding of any sort for many years, under time crunch, begging for help. :)
My end goal is scraping a series of web pages with vertical style data tables and dropping to CSV. With ye olde Google, along with my first post on stack overflow earlier today (at least first time in a decade or more), I got the basics down. I can input a text file with the list of URLs, identify the DIV that contains the table I need, scrape the table so that the first column becomes my header, and second becomes the data row, and repeat for next URLs (without repeating header). The snag I've hit is that the code of these pages is far worse than I thought, including a ton of extra lines, extra spaces, and now as I'm finding, nested tags inside the tags, most of which are empty. But, between the spans and the extra lines, it causes the script I have so far to ignore some of the data inside the TD. For an example of the hideous page code:
<div id="One" class="collapse show" aria-labelledby="headingOne" data-parent="#accordionExample">
<div class="card-body">
<table class="table table-borderless">
<tbody>
<tr>
<td>ID:</td>
<td>
096626 180012
</td>
</tr>
<tr>
<td>Address:</td>
<td>
1234 Main St
</td>
</tr>
<tr>
<td>Addr City:</td>
<td>
City
</td>
</tr>
<tr>
<td> Name :</td>
<td>
Last name, first name<span> </span>
</td>
</tr>
<tr>
<td>In Care Of Address:</td>
<td>
1234<span> </span>
<span> </span>
Main<span> </span>
St <span> </span>
<span> </span>
<span> </span>
</td>
</tr>
<tr>
<td>City/State/Zip:</td>
<td>
City<span> </span>
ST<span> </span>
Zip<span>-</span>
Zip+4
</td>
</tr>
</tbody>
</table>
</div>
</div>
The code I have so far is (right now, the url text file has the name of a locally stored HTML file as above, but have tested with the actual URLs to verify that part works):
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
contents = []
headers = []
values = []
rows = []
num = 0
with open('sampleurls.txt','r') as csvf:
urls = csv.reader(csvf)
for url in urls:
contents.append(url)
for url in contents:
html = open(url[0]).read()
soup = BeautifulSoup(html, 'html.parser')
trs = soup.select('div#One tr')
for t in trs:
for header, value in zip(t.select('td')[0], t.select('td')[1]):
if num == 0:
headers.append(' '.join(header.split()))
values.append(' '.join(value.split()))
rows.append(values)
values = []
num += 1
df = pd.DataFrame(rows, columns= headers)
print(df.head())
df.to_csv('output5.csv')
When executed, the script seems to ignore anything that comes after a newline, or span, not sure which. The output I get is:
,ID:,Address:,Addr City:,Name :,In Care Of Address:,City/State/Zip:
0,096626 180012,1234 Main St,City,"Last name, first name",1234,City
In the "In Care Of Address:" column, instead of getting "1234 Main St", I just get "1234". I also tried this without the join/split function, and the remaining part of the address is still ignored. Is there a way around this? In theory, I don't need any data inside the spans as the only one populated is the hypen in the zip+4, which I don't care about.
Side note, I'm assuming the first column in the output is part of the CSV writing function, but if there's a way to get rid of it I'd like to. Not huge as I can ignore that when I import the CSV into my database, but the cleaner the better.

It is easier with correct info in post in first place.. :)
try:
from bs4 import BeautifulSoup
import pandas as pd
html = '''
<html>
<body>
<div aria-labelledby="headingOne" class="collapse show" data-parent="#accordionExample" id="One">
<div class="card-body">
<table class="table table-borderless">
<tbody>
<tr>
<td>
ID:
</td>
<td>
096626 180012
</td>
</tr>
<tr>
<td>
Address:
</td>
<td>
1234 Main St
</td>
</tr>
<tr>
<td>
Addr City:
</td>
<td>
City
</td>
</tr>
<tr>
<td>
Name :
</td>
<td>
Last name, first name
<span>
</span>
</td>
</tr>
<tr>
<td>
In Care Of Address:
</td>
<td>
1234
<span>
</span>
<span>
</span>
Main
<span>
</span>
St
<span>
</span>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td>
City/State/Zip:
</td>
<td>
City
<span>
</span>
ST
<span>
</span>
Zip
<span>
-
</span>
Zip+4
</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>
'''
contents = []
headers = []
values = []
rows = []
num = 0
soup = BeautifulSoup(html, 'html.parser')
trs = soup.select('div#One tr')
for t in trs:
for header, value in zip(t.select('td')[0], t.select('td:nth-child(2)')):
if num == 0:
headers.append(' '.join(header.split()))
values.append(value.get_text(' ', strip=True))
rows.append(values)
df = pd.DataFrame(rows, columns= headers)
print(df.head())
df.to_csv('output5.csv')
Hope it works now with more relevant info.
The csv:

How to extract data with beautifulsoup with similar attributes

I'm trying to scrape a saved html page of results and copy the entries for each and iterate through the document. However I can't figure out how to narrow down the element to start. The data I want to grab is in the "td" tags below each of the following "tr" tags:
<tr bgcolor="#d7d7d7">
<td valign="top" nowrap="">
Submittal<br>20190919-5000
<!-- ParentAccession= -->
<br>
</td>
<td valign="top">
09/18/2019<br>
09/19/2019
</td>
<td valign="top" nowrap="">
ER19-2760-000<br>ER19-2762-000<br>ER19-2763-000<br>ER19-2764-000<br>ER1 9-2765-000<br>ER19-2766-000<br>ER19-2768-000<br><br>
</td>
<td valign="top">
(doc-less) Motion to Intervene of Snohomish County Public Utility District No. 1 under ER19-2760, et. al..<br>Availability: Public<br>
</td>
<td valign="top">
<classtype>Intervention /<br> Motion/Notice of Intervention</classtype>
</td>
<td valign="top">
<table valign="top">
<input type="HIDDEN" name="ext" value="TXT"><tbody><tr><td valign="top"> <input type="checkbox" name="subcheck" value="V:14800341:12904817:15359058:TXT"></td><td> Text</td><td> & nbsp; 0K</td></tr><input type="HIDDEN" name="ext" value="PDF"><tr><td valign="top"> <input type="checkbox" name="subcheck" value="V:14800341:12904822:15359063:PDF"></td><td> FERC Generated PDF</td><td> 11K</td></tr>
</tbody></table>
</td>
The next tag is: with the same structure as the one above. These alternate so the results are in different colors on the results page.
I need to go through all of the subsequent td tags and grab the data but they aren't differentiated by a class or anything I can zero in on. The code I wrote grabs the entire contents of the td tags text and appends it but I need to treat each td tag as a separate item and then do the same for the next entry etc.
By setting the td[0] value I start at the first td tag but I don't think this is the correct approach.
from bs4 import BeautifulSoup
import urllib
import re
soup = BeautifulSoup(open("/Users/Desktop/FERC/uploads/ferris_9-19-2019-9-19-2019.electric.submittal.html"), "html.parser")
data = []
for td in soup.findAll(bgcolor=["#d7d7d7", "White"]):
values = [td[0].text.strip() for td in td.findAll('td')]
data.append(values)
print(data)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting string from an HTML table from a given tab using Python - python

Related

Select a specific column and ignore the rest in BeautifulSoup Python (Avoiding nested tables)

Finding sibling items between headers

Use beautifulSoup to find a table after a header?

Scraping Multi-Row TD With Nested SPAN tags

How to extract data with beautifulsoup with similar attributes

Categories

Resources