Working with broken HTML + BeautifulSoup - python

I have some wonderfully broken HTML that, long story short, is preventing me from using the normal nested <table>, <tr>, <td> structure that would make it easy to reconstruct tables.
Here's a snippet with line numbers for reference:
1 <td valign="top"> <!-- closing </td> should be on 6 -->
2 <font face="arial" size="1">
3 <center>
4 06-30-95
5 </center>
6 <tr valign="top">
7 <td>
8 <center>
9 <font ,="" arial,="" face="arial" sans="" serif"="" size="1">
10 1382
11 <p>
12 (23)
13 </p>
14 </font>
15 </center>
16 </td>
17 <td>
18 <font ,="" arial,="" face="arial" sans="" serif"="" size="1">
19 <center>
20 06-18-14
21 </center>
22 </font>
23 </td>
24 </tr>
25 </td> <!-- this should should be on 6 -->
The nesting of trs within tds within trs has no scheme to it whatesover, and is coupled with unclosed tags to boot. The HTML tree in no way resembles how it is structurally rendered. (In this case, I suppose there are technically no missing closing tags, but the actual rendering of the page makes it clear there should be no nested tds.)
However, playing by the following set of rules would work in this case:
For any <td> that is followed by an opening <td> before its closing </td>, (i.e. any nested td) assume that the latter opening <td> (line 7) serves as closure for the first (line 1);
Otherwise, just grab (open, close) <td> ... </td> tags as usual (where the opener and closer have no <td> in between them; example would be lines 17 & 23 above.
Desired result here would be something like:
['06-30-95', '1382\n(23)', '06-18-14']
How can this be addressed in BeautifulSoup? I would show an attempt, but have picked through the docs and some of the source and not found much at all.
Currently this would parse to:
html = """
<td valign="top">
<font face="arial" size="1">
<center>
06-30-95
</center>
<tr valign="top">
<td>
<center>
<font ,="" arial,="" face="arial" sans="" serif"="" size="1">
1382
<p>
(23)
</p>
</font>
</center>
</td>
<td>
<font ,="" arial,="" face="arial" sans="" serif"="" size="1">
<center>
06-18-14
</center>
</font>
</td>
</tr>
</td>
"""
from bs4 import BeautifulSoup, SoupStrainer
strainer = SoupStrainer('td')
soup = BeautifulSoup(html, 'html.parser', parse_only=strainer)
[tag.text.replace('\n', '') for tag in soup.find_all('td')]
[' 06-30-95 1382 (23) 06-18-14 ',
' 1382 (23) ',
' 06-18-14 ']
And my issue with that result is not the whitespace; it's the repetition of substrings. It almost seems like I'd need to recursively work upwards from the innermost tags, popping off each and working outwards. But I have to guess there's more built-in functionality for dealing with missing closing tags (handle_endtag stands out from the BeautifulSoup constructor?).

For wonderfully broken HTML, there are two ways you can go about this. First is to find the most consistently sets of opened/closed tags at the innermost possible nested level, and only just make use of the first one. In this limited example provided it looks like the <center> tags will satisfy this. Consider the following:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'html.parser')
>>> [t.find('center').text.strip() for t in soup.find_all('td')]
['06-30-95', '1382\n \n (23)', '06-18-14']
Alternatively, using lxml instead (as the documentation listed that as a method) may actually work better overall:
>>> soup2 = BeautifulSoup(html, 'lxml')
>>> [t.text.strip() for t in soup2.find_all('td')]
['06-30-95', '1382\n \n (23)', '06-18-14']
There are other methods that are covered in this thread: Fast and effective way to parse broken HTML?

Try this. It will fetch you the output you requested for:
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html5lib')
item = [' '.join(items.text.split()) for items in soup.select("center")]
print(item)
Output:
['06-30-95', '1382 (23)', '06-18-14']

Related

Scraping Multi-Row TD With Nested SPAN tags

For the long story short part, just discovered BeautifulSoup yesterday, haven't done scripting or coding of any sort for many years, under time crunch, begging for help. :)
My end goal is scraping a series of web pages with vertical style data tables and dropping to CSV. With ye olde Google, along with my first post on stack overflow earlier today (at least first time in a decade or more), I got the basics down. I can input a text file with the list of URLs, identify the DIV that contains the table I need, scrape the table so that the first column becomes my header, and second becomes the data row, and repeat for next URLs (without repeating header). The snag I've hit is that the code of these pages is far worse than I thought, including a ton of extra lines, extra spaces, and now as I'm finding, nested tags inside the tags, most of which are empty. But, between the spans and the extra lines, it causes the script I have so far to ignore some of the data inside the TD. For an example of the hideous page code:
<div id="One" class="collapse show" aria-labelledby="headingOne" data-parent="#accordionExample">
<div class="card-body">
<table class="table table-borderless">
<tbody>
<tr>
<td>ID:</td>
<td>
096626 180012
</td>
</tr>
<tr>
<td>Address:</td>
<td>
1234 Main St
</td>
</tr>
<tr>
<td>Addr City:</td>
<td>
City
</td>
</tr>
<tr>
<td> Name :</td>
<td>
Last name, first name<span> </span>
</td>
</tr>
<tr>
<td>In Care Of Address:</td>
<td>
1234<span> </span>
<span> </span>
Main<span> </span>
St <span> </span>
<span> </span>
<span> </span>
</td>
</tr>
<tr>
<td>City/State/Zip:</td>
<td>
City<span> </span>
ST<span> </span>
Zip<span>-</span>
Zip+4
</td>
</tr>
</tbody>
</table>
</div>
</div>
The code I have so far is (right now, the url text file has the name of a locally stored HTML file as above, but have tested with the actual URLs to verify that part works):
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
contents = []
headers = []
values = []
rows = []
num = 0
with open('sampleurls.txt','r') as csvf:
urls = csv.reader(csvf)
for url in urls:
contents.append(url)
for url in contents:
html = open(url[0]).read()
soup = BeautifulSoup(html, 'html.parser')
trs = soup.select('div#One tr')
for t in trs:
for header, value in zip(t.select('td')[0], t.select('td')[1]):
if num == 0:
headers.append(' '.join(header.split()))
values.append(' '.join(value.split()))
rows.append(values)
values = []
num += 1
df = pd.DataFrame(rows, columns= headers)
print(df.head())
df.to_csv('output5.csv')
When executed, the script seems to ignore anything that comes after a newline, or span, not sure which. The output I get is:
,ID:,Address:,Addr City:,Name :,In Care Of Address:,City/State/Zip:
0,096626 180012,1234 Main St,City,"Last name, first name",1234,City
In the "In Care Of Address:" column, instead of getting "1234 Main St", I just get "1234". I also tried this without the join/split function, and the remaining part of the address is still ignored. Is there a way around this? In theory, I don't need any data inside the spans as the only one populated is the hypen in the zip+4, which I don't care about.
Side note, I'm assuming the first column in the output is part of the CSV writing function, but if there's a way to get rid of it I'd like to. Not huge as I can ignore that when I import the CSV into my database, but the cleaner the better.
It is easier with correct info in post in first place.. :)
try:
from bs4 import BeautifulSoup
import pandas as pd
html = '''
<html>
<body>
<div aria-labelledby="headingOne" class="collapse show" data-parent="#accordionExample" id="One">
<div class="card-body">
<table class="table table-borderless">
<tbody>
<tr>
<td>
ID:
</td>
<td>
096626 180012
</td>
</tr>
<tr>
<td>
Address:
</td>
<td>
1234 Main St
</td>
</tr>
<tr>
<td>
Addr City:
</td>
<td>
City
</td>
</tr>
<tr>
<td>
Name :
</td>
<td>
Last name, first name
<span>
</span>
</td>
</tr>
<tr>
<td>
In Care Of Address:
</td>
<td>
1234
<span>
</span>
<span>
</span>
Main
<span>
</span>
St
<span>
</span>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td>
City/State/Zip:
</td>
<td>
City
<span>
</span>
ST
<span>
</span>
Zip
<span>
-
</span>
Zip+4
</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>
'''
contents = []
headers = []
values = []
rows = []
num = 0
soup = BeautifulSoup(html, 'html.parser')
trs = soup.select('div#One tr')
for t in trs:
for header, value in zip(t.select('td')[0], t.select('td:nth-child(2)')):
if num == 0:
headers.append(' '.join(header.split()))
values.append(value.get_text(' ', strip=True))
rows.append(values)
df = pd.DataFrame(rows, columns= headers)
print(df.head())
df.to_csv('output5.csv')
Hope it works now with more relevant info.
The csv:

How to extract data with beautifulsoup with similar attributes

I'm trying to scrape a saved html page of results and copy the entries for each and iterate through the document. However I can't figure out how to narrow down the element to start. The data I want to grab is in the "td" tags below each of the following "tr" tags:
<tr bgcolor="#d7d7d7">
<td valign="top" nowrap="">
Submittal<br>20190919-5000
<!-- ParentAccession= -->
<br>
</td>
<td valign="top">
09/18/2019<br>
09/19/2019
</td>
<td valign="top" nowrap="">
ER19-2760-000<br>ER19-2762-000<br>ER19-2763-000<br>ER19-2764-000<br>ER1 9-2765-000<br>ER19-2766-000<br>ER19-2768-000<br><br>
</td>
<td valign="top">
(doc-less) Motion to Intervene of Snohomish County Public Utility District No. 1 under ER19-2760, et. al..<br>Availability: Public<br>
</td>
<td valign="top">
<classtype>Intervention /<br> Motion/Notice of Intervention</classtype>
</td>
<td valign="top">
<table valign="top">
<input type="HIDDEN" name="ext" value="TXT"><tbody><tr><td valign="top"> <input type="checkbox" name="subcheck" value="V:14800341:12904817:15359058:TXT"></td><td> Text</td><td> & nbsp; 0K</td></tr><input type="HIDDEN" name="ext" value="PDF"><tr><td valign="top"> <input type="checkbox" name="subcheck" value="V:14800341:12904822:15359063:PDF"></td><td> FERC Generated PDF</td><td> 11K</td></tr>
</tbody></table>
</td>
The next tag is: with the same structure as the one above. These alternate so the results are in different colors on the results page.
I need to go through all of the subsequent td tags and grab the data but they aren't differentiated by a class or anything I can zero in on. The code I wrote grabs the entire contents of the td tags text and appends it but I need to treat each td tag as a separate item and then do the same for the next entry etc.
By setting the td[0] value I start at the first td tag but I don't think this is the correct approach.
from bs4 import BeautifulSoup
import urllib
import re
soup = BeautifulSoup(open("/Users/Desktop/FERC/uploads/ferris_9-19-2019-9-19-2019.electric.submittal.html"), "html.parser")
data = []
for td in soup.findAll(bgcolor=["#d7d7d7", "White"]):
values = [td[0].text.strip() for td in td.findAll('td')]
data.append(values)
print(data)

bs4 parent attrs python

I'm just starting coding in Python and my friend asked me for application finding specific data on the web, representing it nicely.
I already found pretty web, where the data is contained, I can find basic info, but then the challenge is to get deeper.
While using BS4 in Python 3.4 I have reached exemplary code:
<tr class=" " somethingc1="" somethingc2="" somethingc3="" data-something="1" something="1something6" something_id="6something0">
<td class="text-center td_something">
<div>
Super String of Something
</div>
</td>
<td class="text-center">08/26 15:00</td>
<td class="text-center something_status">
<span class="something_status_something">Full</span>
</td>
</tr>
<tr class=" " somethingc1="" somethingc2="" somethingc3="" data-something="0" something="1something4" something_id="6something7">
<td class="text-center td_something">
<div>
Super String of Something
</div>
</td>
<td class="text-center">05/26 15:00</td>
<td class="text-center something_status">
<span class="something_status_something"></span>
</td>
</tr>
What I want to do now is finding the date string of but only if data-something="1" of parent and not if data-something="0"
I can scrap all dates by :
soup.find_all(lambda tag: tag.name == 'td' and tag.get('class') == ['text-center'] and not tag.has_attr('style'))
but it does not check parent. That is why I tried:
def KieMeWar(tag):
return tag.name == 'td' and tag.parent.name == 'tr' and tag.parent.attrs == {"data-something": "1"} #and tag.get('class') == ['text-center'] and not tag.has_attr('style')
soup.find_all(KieMeWar)
The result is an empty set. What is wrong or how to reach the target I am aiming for with easiest solution?
P.S. This is exemplary part of full code, that is why I use not Style, even though it does not appear here but does so later.
BeautifulSoup's findAll has the attrs kwarg, which is used to find tags with a given attribute
import bs4
soup = bs4.BeautifulSoup(html)
trs = soup.findAll('tr', attrs={'data-something':'1'})
That finds all tr tags with the attribute data-something="1". Afterwards, you can loop through the trs and grab the 2nd td tag to extract the date
for t in trs:
print(str(t.findAll('td')[1].text))
>>> 08/26 15:00

Beautiful Soup: extracting tagged and untagged HTML text

As a novice with bs4 I'm looking for some help in working out how to extract the text from a series of webpage tables, one of which is like this:
<table style="padding:0px; margin:1px" width="715px">
<tr>
<td height="22" width="33%" >
<span class="darkGreenText"><strong> Name: </strong></span>
Tyto alba
</td>
<td height="22" width="33%" >
<span class="darkGreenText"><strong> Order: </strong></span>
Strigiformes
</td>
<td height="22" width="33%">
<span class="darkGreenText"><strong> Family: </strong></span>
Tytonidae
</td>
<td height="22" width="66%" colspan="2">
<span class="darkGreenText"><strong> Status: </strong></span>
Least Concern
</td>
</tr>
</table>
Desired output:
Name: Tyto alba
Order: Strigiformes
Family: Tytonidae
Status: Least Concern
I've tried using [index] as recommended (https://stackoverflow.com/a/35050622/1726290),
and also next_sibling (https://stackoverflow.com/a/23380225/1726290) but I'm getting stuck as one part of the text I need is tagged and the second part is not. Any help would be appreciated.
It seems like what you want is to call get_text(strip=True)(docs) on the BeautifulSoup Tag. Assuming raw_html is the html you pasted above:
htmlSoup = BeautifulSoup(raw_html)
for tag in htmlSoup.select('td'):
print(tag.get_text(strip=True))
which prints:
Name:Tyto alba
Order:Strigiformes
Family:Tytonidae
Status:Least Concern

Python BeautifulSoup parsing specific text

I am parsing an html file and I want to find the part of the file where it says "Smaller Reporting Company" and either has an "X" or Checkbox next to it or doesn't. The checkbox is typically done with the Wingdings font or an ascii code. In the HTML below you'll see it has an þ in wingdings next to it.
I have no problem showing the results of a regular expression search for the text, but I'm having trouble going the next step and looking for a check box.
I will be using this to parse a number of different html files that won't all follow the same format, but most of them will use a table and ascii text like this example.
Here is the HTML code:
<HTML>
<HEAD><TITLE></TITLE></HEAD>
<BODY>
<DIV align="left">Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, or a smaller reporting company. See the definitions of “large accelerated filer,” “accelerated filer” and “smaller reporting company”. (Check one):
</DIV>
<DIV align="center">
<TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">
<!-- Begin Table Head -->
<TR valign="bottom">
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
</TR>
<TR></TR>
<!-- End Table Head -->
<!-- Begin Table Body -->
<TR valign="bottom">
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Large accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap">Accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Non-accelerated filer <FONT style="font-family: Wingdings">o</FONT> </FONT>
<FONT style="white-space: nowrap">(Do not check if a smaller reporting company)</FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Smaller reporting company <FONT style="font-family: Wingdings">þ</FONT></FONT></TD>
</TR>
<!-- End Table Body -->
</TABLE>
</DIV></BODY></HTML>
Here is my Python code:
import os, sys, string, re
from BeautifulSoup import BeautifulSoup
rawDataFile = "testfile1.html"
f = open(rawDataFile)
soup = BeautifulSoup(f)
f.close()
search = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
print search
Question:
How could I set this up to have a second search that is dependent upon the first search? So when I find "smaller reporting company" I can search the next few lines to see if there is an ascii code? I've been going through the soup docs. I tried to do find and findNext but I haven't been able to get it to work.
If you know the position of the wingding character won't change, you can use .next.
>>> nodes = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
>>> nodes[-1].next.next # last item in list is the only good one... kinda crap
u'þ'
Or you can go up, and then find from there:
>>> nodes[-1].parent.find('font',style="font-family: Wingdings").next
u'þ'
Or you could do it the other way round:
>>> soup.findAll(text='þ')[0].previous.previous
u' Smaller reporting company '
This assume that you know the wingding caharcters you're looking for.
The last strategy has the added bonus of filtering out other crap that your regex is catching, which I suppose you don't really want; you can then just cycle through results knowing that you're only working on the right list, so you can peruse if to your liking.
You may try iterating through the structure and checking for values inside the inner tags or checking for values in the outer tags. I can't remember off hand how to do it and I ended up using lxml for this, but I think bsoup may be able to do this.
If you can't get bsoup to do it check out lxml. It is potentially faster depending upon what you are doing. It also has hooks for using bsoup with lxml.
lxml has a tolerant HTML parser. You don't need bsoup (which is now deprecated by its author) and you should avoid regexes for parsing HTML.
Here is a first rough cut at what you are looking for:
guff = """\
<HTML>
<HEAD><TITLE></TITLE></HEAD>
[snip]
</DIV></BODY></HTML>
"""
from lxml.html import fromstring
doc = fromstring(guff)
for td_el in doc.iter('td'):
font_els = list(td_el.iter('font'))
if not font_els: continue
print
for el in font_els:
print (el.text, el.attrib)
This produces:
(' Large accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
('Accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
(' Non-accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
('(Do not check if a smaller reporting company)', {'style': 'white-space: nowrap
'})
(' Smaller reporting company ', {'style': 'white-space: nowrap'})
(u'\xfe', {'style': 'font-family: Wingdings'})

Categories