Python BeautifulSoup parsing specific text - python

I am parsing an html file and I want to find the part of the file where it says "Smaller Reporting Company" and either has an "X" or Checkbox next to it or doesn't. The checkbox is typically done with the Wingdings font or an ascii code. In the HTML below you'll see it has an þ in wingdings next to it.
I have no problem showing the results of a regular expression search for the text, but I'm having trouble going the next step and looking for a check box.
I will be using this to parse a number of different html files that won't all follow the same format, but most of them will use a table and ascii text like this example.
Here is the HTML code:
<HTML>
<HEAD><TITLE></TITLE></HEAD>
<BODY>
<DIV align="left">Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, or a smaller reporting company. See the definitions of “large accelerated filer,” “accelerated filer” and “smaller reporting company”. (Check one):
</DIV>
<DIV align="center">
<TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">
<!-- Begin Table Head -->
<TR valign="bottom">
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
</TR>
<TR></TR>
<!-- End Table Head -->
<!-- Begin Table Body -->
<TR valign="bottom">
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Large accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap">Accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Non-accelerated filer <FONT style="font-family: Wingdings">o</FONT> </FONT>
<FONT style="white-space: nowrap">(Do not check if a smaller reporting company)</FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Smaller reporting company <FONT style="font-family: Wingdings">þ</FONT></FONT></TD>
</TR>
<!-- End Table Body -->
</TABLE>
</DIV></BODY></HTML>
Here is my Python code:
import os, sys, string, re
from BeautifulSoup import BeautifulSoup
rawDataFile = "testfile1.html"
f = open(rawDataFile)
soup = BeautifulSoup(f)
f.close()
search = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
print search
Question:
How could I set this up to have a second search that is dependent upon the first search? So when I find "smaller reporting company" I can search the next few lines to see if there is an ascii code? I've been going through the soup docs. I tried to do find and findNext but I haven't been able to get it to work.

If you know the position of the wingding character won't change, you can use .next.
>>> nodes = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
>>> nodes[-1].next.next # last item in list is the only good one... kinda crap
u'þ'
Or you can go up, and then find from there:
>>> nodes[-1].parent.find('font',style="font-family: Wingdings").next
u'þ'
Or you could do it the other way round:
>>> soup.findAll(text='þ')[0].previous.previous
u' Smaller reporting company '
This assume that you know the wingding caharcters you're looking for.
The last strategy has the added bonus of filtering out other crap that your regex is catching, which I suppose you don't really want; you can then just cycle through results knowing that you're only working on the right list, so you can peruse if to your liking.

You may try iterating through the structure and checking for values inside the inner tags or checking for values in the outer tags. I can't remember off hand how to do it and I ended up using lxml for this, but I think bsoup may be able to do this.
If you can't get bsoup to do it check out lxml. It is potentially faster depending upon what you are doing. It also has hooks for using bsoup with lxml.

lxml has a tolerant HTML parser. You don't need bsoup (which is now deprecated by its author) and you should avoid regexes for parsing HTML.
Here is a first rough cut at what you are looking for:
guff = """\
<HTML>
<HEAD><TITLE></TITLE></HEAD>
[snip]
</DIV></BODY></HTML>
"""
from lxml.html import fromstring
doc = fromstring(guff)
for td_el in doc.iter('td'):
font_els = list(td_el.iter('font'))
if not font_els: continue
print
for el in font_els:
print (el.text, el.attrib)
This produces:
(' Large accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
('Accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
(' Non-accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
('(Do not check if a smaller reporting company)', {'style': 'white-space: nowrap
'})
(' Smaller reporting company ', {'style': 'white-space: nowrap'})
(u'\xfe', {'style': 'font-family: Wingdings'})

Related

How to extract data with beautifulsoup with similar attributes

I'm trying to scrape a saved html page of results and copy the entries for each and iterate through the document. However I can't figure out how to narrow down the element to start. The data I want to grab is in the "td" tags below each of the following "tr" tags:
<tr bgcolor="#d7d7d7">
<td valign="top" nowrap="">
Submittal<br>20190919-5000
<!-- ParentAccession= -->
<br>
</td>
<td valign="top">
09/18/2019<br>
09/19/2019
</td>
<td valign="top" nowrap="">
ER19-2760-000<br>ER19-2762-000<br>ER19-2763-000<br>ER19-2764-000<br>ER1 9-2765-000<br>ER19-2766-000<br>ER19-2768-000<br><br>
</td>
<td valign="top">
(doc-less) Motion to Intervene of Snohomish County Public Utility District No. 1 under ER19-2760, et. al..<br>Availability: Public<br>
</td>
<td valign="top">
<classtype>Intervention /<br> Motion/Notice of Intervention</classtype>
</td>
<td valign="top">
<table valign="top">
<input type="HIDDEN" name="ext" value="TXT"><tbody><tr><td valign="top"> <input type="checkbox" name="subcheck" value="V:14800341:12904817:15359058:TXT"></td><td> Text</td><td> & nbsp; 0K</td></tr><input type="HIDDEN" name="ext" value="PDF"><tr><td valign="top"> <input type="checkbox" name="subcheck" value="V:14800341:12904822:15359063:PDF"></td><td> FERC Generated PDF</td><td> 11K</td></tr>
</tbody></table>
</td>
The next tag is: with the same structure as the one above. These alternate so the results are in different colors on the results page.
I need to go through all of the subsequent td tags and grab the data but they aren't differentiated by a class or anything I can zero in on. The code I wrote grabs the entire contents of the td tags text and appends it but I need to treat each td tag as a separate item and then do the same for the next entry etc.
By setting the td[0] value I start at the first td tag but I don't think this is the correct approach.
from bs4 import BeautifulSoup
import urllib
import re
soup = BeautifulSoup(open("/Users/Desktop/FERC/uploads/ferris_9-19-2019-9-19-2019.electric.submittal.html"), "html.parser")
data = []
for td in soup.findAll(bgcolor=["#d7d7d7", "White"]):
values = [td[0].text.strip() for td in td.findAll('td')]
data.append(values)
print(data)

Working with broken HTML + BeautifulSoup

I have some wonderfully broken HTML that, long story short, is preventing me from using the normal nested <table>, <tr>, <td> structure that would make it easy to reconstruct tables.
Here's a snippet with line numbers for reference:
1 <td valign="top"> <!-- closing </td> should be on 6 -->
2 <font face="arial" size="1">
3 <center>
4 06-30-95
5 </center>
6 <tr valign="top">
7 <td>
8 <center>
9 <font ,="" arial,="" face="arial" sans="" serif"="" size="1">
10 1382
11 <p>
12 (23)
13 </p>
14 </font>
15 </center>
16 </td>
17 <td>
18 <font ,="" arial,="" face="arial" sans="" serif"="" size="1">
19 <center>
20 06-18-14
21 </center>
22 </font>
23 </td>
24 </tr>
25 </td> <!-- this should should be on 6 -->
The nesting of trs within tds within trs has no scheme to it whatesover, and is coupled with unclosed tags to boot. The HTML tree in no way resembles how it is structurally rendered. (In this case, I suppose there are technically no missing closing tags, but the actual rendering of the page makes it clear there should be no nested tds.)
However, playing by the following set of rules would work in this case:
For any <td> that is followed by an opening <td> before its closing </td>, (i.e. any nested td) assume that the latter opening <td> (line 7) serves as closure for the first (line 1);
Otherwise, just grab (open, close) <td> ... </td> tags as usual (where the opener and closer have no <td> in between them; example would be lines 17 & 23 above.
Desired result here would be something like:
['06-30-95', '1382\n(23)', '06-18-14']
How can this be addressed in BeautifulSoup? I would show an attempt, but have picked through the docs and some of the source and not found much at all.
Currently this would parse to:
html = """
<td valign="top">
<font face="arial" size="1">
<center>
06-30-95
</center>
<tr valign="top">
<td>
<center>
<font ,="" arial,="" face="arial" sans="" serif"="" size="1">
1382
<p>
(23)
</p>
</font>
</center>
</td>
<td>
<font ,="" arial,="" face="arial" sans="" serif"="" size="1">
<center>
06-18-14
</center>
</font>
</td>
</tr>
</td>
"""
from bs4 import BeautifulSoup, SoupStrainer
strainer = SoupStrainer('td')
soup = BeautifulSoup(html, 'html.parser', parse_only=strainer)
[tag.text.replace('\n', '') for tag in soup.find_all('td')]
[' 06-30-95 1382 (23) 06-18-14 ',
' 1382 (23) ',
' 06-18-14 ']
And my issue with that result is not the whitespace; it's the repetition of substrings. It almost seems like I'd need to recursively work upwards from the innermost tags, popping off each and working outwards. But I have to guess there's more built-in functionality for dealing with missing closing tags (handle_endtag stands out from the BeautifulSoup constructor?).
For wonderfully broken HTML, there are two ways you can go about this. First is to find the most consistently sets of opened/closed tags at the innermost possible nested level, and only just make use of the first one. In this limited example provided it looks like the <center> tags will satisfy this. Consider the following:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html, 'html.parser')
>>> [t.find('center').text.strip() for t in soup.find_all('td')]
['06-30-95', '1382\n \n (23)', '06-18-14']
Alternatively, using lxml instead (as the documentation listed that as a method) may actually work better overall:
>>> soup2 = BeautifulSoup(html, 'lxml')
>>> [t.text.strip() for t in soup2.find_all('td')]
['06-30-95', '1382\n \n (23)', '06-18-14']
There are other methods that are covered in this thread: Fast and effective way to parse broken HTML?
Try this. It will fetch you the output you requested for:
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, 'html5lib')
item = [' '.join(items.text.split()) for items in soup.select("center")]
print(item)
Output:
['06-30-95', '1382 (23)', '06-18-14']

Beautiful Soup: extracting tagged and untagged HTML text

As a novice with bs4 I'm looking for some help in working out how to extract the text from a series of webpage tables, one of which is like this:
<table style="padding:0px; margin:1px" width="715px">
<tr>
<td height="22" width="33%" >
<span class="darkGreenText"><strong> Name: </strong></span>
Tyto alba
</td>
<td height="22" width="33%" >
<span class="darkGreenText"><strong> Order: </strong></span>
Strigiformes
</td>
<td height="22" width="33%">
<span class="darkGreenText"><strong> Family: </strong></span>
Tytonidae
</td>
<td height="22" width="66%" colspan="2">
<span class="darkGreenText"><strong> Status: </strong></span>
Least Concern
</td>
</tr>
</table>
Desired output:
Name: Tyto alba
Order: Strigiformes
Family: Tytonidae
Status: Least Concern
I've tried using [index] as recommended (https://stackoverflow.com/a/35050622/1726290),
and also next_sibling (https://stackoverflow.com/a/23380225/1726290) but I'm getting stuck as one part of the text I need is tagged and the second part is not. Any help would be appreciated.
It seems like what you want is to call get_text(strip=True)(docs) on the BeautifulSoup Tag. Assuming raw_html is the html you pasted above:
htmlSoup = BeautifulSoup(raw_html)
for tag in htmlSoup.select('td'):
print(tag.get_text(strip=True))
which prints:
Name:Tyto alba
Order:Strigiformes
Family:Tytonidae
Status:Least Concern

Pull out information from between tags without unique classes or IDs

I'm trying to scrape the content of a particular website and render the output so that it can be further manipulated / used in other mediums. The biggest challenge I'm facing is that few of the tags have unique IDs or classes, and some of the content is simply displayed in between tags, e.g., <br></br>TEXT<br></br> (see, for example, "Ranking" in the sample HTML below).
Somehow, I've created working code - even if commensurate with the skill of someone in the fourth grade - but this is the furthest I've gotten, and I was hoping to get some help on how to continue to pull out the relevant information. Ultimately, I'm looking to pull any plain text within tags, and plain text in between tags. The only exception is that, whenever there's an img of word_icon_b.gif or word_icon_R.gif, then text "active" or "inactive" gets substituted.
Below is the code I've managed to cobble together:
from bs4 import BeautifulSoup
import urllib2
pageFile = urllib2.urlopen("http://www.url.com")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
table = soup.findAll("td",{"class": ["row_col_1", "row_col_2"]})
print table
There are several tables in the page, but none have a unique ID or class, so I didn't reference them, and instead just pulled out the TDs - since these are the only tags with unique classes which sandwich the information I'm looking for. The HTML that the above pulls from is as follows:
<tr>
<td class="row_col_1" valign="top" nowrap="" scope="row">
5/22/2014
</td>
<td class="row_col_1" valign="top" scope="row">
100%
</td>
<td class="row_col_1" valign="top" nowrap="" scope="row">
<a target="_top" href="/NY/2014/N252726.DOC">
<img width="18" height="14" border="0" alt="Click here to download word document n252726.doc" src="images/word_icon_b.gif"></img>
</a>
<a target="_top" href="index.asp?ru=n252726&qu=ranking&vw=detail">
NY N252726
</a>
<br></br>
Ranking
<br></br>
<a href="javascript:disclaimer('EU Regulatory Body','h…cripts/listing_current.asp?Phase=List_items&lookfor=847720');">
8477.20.mnop
</a>
<br></br>
<a target="_new" href="http://dataweb.url.com/scripts/ranking_current.asp?Phase=List_items&lookfor=847759">
8477.59.abcd
</a>
<br></br>
</td>
<td class="row_col_1" valign="top" scope="row">
The ranking of a long-fanged monkey sock puppet that is coding-ly challenged
</td>
<td class="row_col_1" valign="top" nowrap="" scope="row">
</td>
</tr>
The reason why I have ["row_col_1", "row_col_2"] is because the data served up is presented as <td class="row_col_1" valign="top" nowrap="" scope="row"> for the odd rows, and <td class="row_col_2" valign="top" nowrap="" scope="row"> for the even rows. I have no control over the HTML that attempting to I'm pulling from.
Also, the base links, such as javascript:disclaimer('EU Regulatory Body','h…cripts/listing_current.asp? and http://dataweb.url.com/scripts/ranking_current.asp?Phase=List_items& will always remain the same (though the specific links will change, e.g., current.asp?Phase=List_items&lookfor=847759" may next be on the next page as current.asp?Phase=List_items&lookfor=101010">).
EDIT: #Martijn: I'm hoping to have returned to me the following items from the HTML: 1) 5/22/2014, 2) 100%, 3) the image name, word_icon_b.gif (to substitute text for it) 4) NY N252726 (and the preceding link), 5) Ranking, 6) 8477.20.mnop (and the preceding link), 7) 8477.59.abcd (and the preceding link), and 8) 'The ranking of a long-fanged monkey sock puppet that is coding-ly challenged.'
I'd like the output to be wrapped in XML tags, but this is not excessively important I imagine these tags can just be inserted into the bs4 code.
If you would like to try lxml library and xpath, this is a hint on how your code might look like. You should probably make a better selection of the desired <tr>s, than I did without seeing the html code. Also you should handle any potential IndexErrors, etc..
from lxml import html
pageFile = urllib2.urlopen("http://www.url.com")
pageHtml = pageFile.read()
pageFile.close()
x = html.fromstring(pageHtml)
all_rows = x.xpath(".//tr")
results = []
for row in all_rows:
date = row.xpath(".//td[contains(#class, 'row_col')/text()]")[0]
location = row.xpath(".//a[contains(#href, 'index.asp')/text()]")[0]
rank = row.xpath(".//a[contains(#href, 'javascript:disclaimer(')]/text()")[0]
results.append({'date': date, 'location': location, 'rank': rank})

Regex returning nothing in Python

I'm working in Python for the first time and I've used Mechanize to search a website along with BeautifulSoup to select a particular div, now I'm trying to grab a specific sentence with a regular expression. This is the soup object's contents;
<div id="results">
<table cellspacing="0" width="100%">
<tr>
<th align="left" valign="middle" width="32%">Physician Name, (CPSO#)</th>
<th align="left" valign="middle" width="36%">Primary Practice Location</th>
<!-- <th width="16%" align="center" valign="middle">Accepting New Patients?</th> -->
<th align="center" valign="middle" width="32%">Disciplinary Info & Restrictions</th>
</tr>
<tr>
<td>
<a class="doctor" href="details.aspx?view=1&id= 85956">Hull, Christopher Merritt </a> (#85956)
</td>
<td>Four Counties Medical Clinic<br/>1824 Concessions Dr<br/>Newbury ON N0L 1Z0<br/>Phone: (519) 693-0350<br/>Fax: (519) 693-0083</td>
<!-- <td></td> -->
<td align="center"></td>
</tr>
</table>
</div>
(Thank you for the assistance with formatting)
My regular expression to get the text "Hull, Christopher Merritt" is;
patFinderName = re.compile('<a class="doctor" href="details.aspx?view=1&id= 85956">(.*) </a>')
It keeps returning empty and I can't figure out why, anybody have any ideas?
Thank you for the answers, I've changed it to;
patFinderName = re.compile('<a class="doctor" href=".*">(.*) </a>')
Now it works beautifully.
? is a magic token in regular expressions, meaning zero or one of the previous atom. As you want a literal question mark symbol, you need to escape it.
You should escape the ? in your regex:
In [8]: re.findall('<a class="doctor" href="details.aspx\?view=1&id= 85956">(.*)</a>', text)
Out[8]: ['Hull, Christopher Merritt ']

Categories