Search in HTML page using Regex patterns with python

Search in HTML page using Regex patterns with python - python

I'm trying to find a string inside a HTML page with known patterns.
for example, in the following HTML code:
<TABLE WIDTH="100%">
<TR><TD ALIGN="LEFT" width="50%"> </TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM WIDTH=50%><FONT SIZE=-1>( <STRONG>1</STRONG></FONT> <FONT SIZE=-2>of</FONT> <STRONG><FONT SIZE=-1>1</STRONG> )</FONT></TD></TR></TABLE>
<HR>
<TABLE WIDTH="100%">
<TR> <TD ALIGN="LEFT" WIDTH="50%"><B>String 1</B></TD>
<TD ALIGN="RIGHT" WIDTH="50%"><B><A Name=h1 HREF=#h0></A><A HREF=#h2></A><B><I></I></B>String</B></TD>
</TR>
<TR><TD ALIGN="LEFT" WIDTH="50%"><b>String 2.</B>
</TD>
<TD ALIGN="RIGHT" WIDTH="50%"> <B>
String 3
</B></TD>
</TR>
</TABLE>
<HR>
<font size="+1">String 4</font><BR>
...
I want to find String 4 , and I know that it will always be between
<HR><font size="+1">
and </font><BR>
how can I search for the string using RE?
edit:
I've tried the following, but no success:
p = re.match('<HR><font size="+1">(.*?)</font><BR>',html)
thanks.

re.findall(r'<HR>\s*<font size="\+1">(.*?)</font><BR>', html, re.DOTALL)
findall is returning a list with everything that is captured between the brackets in the regular expression. I used re.DOTALL so the dot also captures end of lines.
I used \s* because I was not sure whether there would be any whitespace.

This works, but may not be very robust:
import re
r = re.compile('<HR>\s?<font size="\+1">(.+?)</font>\s?<BR>', re.IGNORECASE)
r.findall(html)
You will be better off using a proper HTML parser. BeautifulSoup is excellent and easy to use. Look it up.

re.findall(r'<HR>\n<font size="\+1">([^<]*)<\/font><BR>', html, re.MULTILINE)

Related

Beautiful Soup: extracting tagged and untagged HTML text

As a novice with bs4 I'm looking for some help in working out how to extract the text from a series of webpage tables, one of which is like this:
<table style="padding:0px; margin:1px" width="715px">
<tr>
<td height="22" width="33%" >
<span class="darkGreenText"><strong> Name: </strong></span>
Tyto alba
</td>
<td height="22" width="33%" >
<span class="darkGreenText"><strong> Order: </strong></span>
Strigiformes
</td>
<td height="22" width="33%">
<span class="darkGreenText"><strong> Family: </strong></span>
Tytonidae
</td>
<td height="22" width="66%" colspan="2">
<span class="darkGreenText"><strong> Status: </strong></span>
Least Concern
</td>
</tr>
</table>
Desired output:
Name: Tyto alba
Order: Strigiformes
Family: Tytonidae
Status: Least Concern
I've tried using [index] as recommended (https://stackoverflow.com/a/35050622/1726290),
and also next_sibling (https://stackoverflow.com/a/23380225/1726290) but I'm getting stuck as one part of the text I need is tagged and the second part is not. Any help would be appreciated.

It seems like what you want is to call get_text(strip=True)(docs) on the BeautifulSoup Tag. Assuming raw_html is the html you pasted above:
htmlSoup = BeautifulSoup(raw_html)
for tag in htmlSoup.select('td'):
print(tag.get_text(strip=True))
which prints:
Name:Tyto alba
Order:Strigiformes
Family:Tytonidae
Status:Least Concern

Extract attribute's value with XPath in Python

I have the HTML:
<table>
<tbody>
<tr>
<td align="left" valign="top" style="padding: 0 10px 0 60px;">
<img src="/files/39.jpg" width="64" height="64">
</td>
<td align="left" valign="middle"><h1>30 Rock</h1></td>
</tr>
</tbody>
</table>
Using Python and LXML I need to extract the value from the attribute src of the <img> element. Here's what I've tried:
import lxml.html
import urllib
# make HTTP request to site
page = urllib.urlopen("http://my.url.com")
# read the downloaded page
doc = lxml.html.document_fromstring(page.read())
txt1 = doc.xpath('/html/body/table[2]/tbody/tr/td[1]/img')
When I print txt1 I get the empty list only []. How can I correct this?

Use this XPath:
//img/#src
Selects the src attributes of all img elements in the entire input XML document

How to create a regex for the following scenario (HTML)?

I have a few known formats in an HTML page, I need to parse the content of the tags
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center> **VALUES_TO_FIND** </TD>
<TD> </TD>
</TR>
<TR>
<TD align=center> </TD>
</TR>
basically I thought I can concatenate the HTML with a regular expression that will match anything inside the spot I'm looking for.
I know that the text before and after VALUES_TO_FIND will always be the same. how can I find it using RE? (I'm dealing with several cases and the format can repeat in several places in the page.

This is what you are looking for:
import re
s="""
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center> **VALUES_TO_FIND** </TD>
<TD> </TD>
</TR>
"""
p="""
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center>(.*)</TD>
<TD> </TD>
</TR>
"""
m=re.search(p, s)
print m.group(1)

Don't use regular expression to parse HTML (It's not a regular language).
There are many threads on the topic at stackoverflow.
I recommend you to use: BeautifulSoup, Pattern and similar modules.

There are many better options for getting data out of HTML than regular expressions. Try Scrapy, for example.

HTML isn't a regular language, using regular expression to work with it is difficult.
BeautifulSoup is a nice parser, here's an example how to use it:
from BeautifulSoup import BeautifulSoup
html = u'''
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center> **VALUES_TO_FIND** </TD>
<TD> </TD>
</TR>
<TR>
<TD align=center> </TD>
</TR>'''
bs = BeautifulSoup(html)
print [td.contents for td in bs.findAll('td')]
output:
[[u'Reissue of:'], [u' **VALUES_TO_FIND** '], [u' '], [u' ']]
You know what to do from here. :)
Install with pip install BeautifulSoup. Here are the docs:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

This regular expression will do:
re.findall(r'<TR>\s+<TD.+?</TD>\s+<TD align=center>(.*?)</TD>',html,re.DOTALL)
But I recommend using a parser.

Regex returning nothing in Python

I'm working in Python for the first time and I've used Mechanize to search a website along with BeautifulSoup to select a particular div, now I'm trying to grab a specific sentence with a regular expression. This is the soup object's contents;
<div id="results">
<table cellspacing="0" width="100%">
<tr>
<th align="left" valign="middle" width="32%">Physician Name, (CPSO#)</th>
<th align="left" valign="middle" width="36%">Primary Practice Location</th>
<!-- <th width="16%" align="center" valign="middle">Accepting New Patients?</th> -->
<th align="center" valign="middle" width="32%">Disciplinary Info & Restrictions</th>
</tr>
<tr>
<td>
<a class="doctor" href="details.aspx?view=1&id= 85956">Hull, Christopher Merritt </a> (#85956)
</td>
<td>Four Counties Medical Clinic<br/>1824 Concessions Dr<br/>Newbury ON N0L 1Z0<br/>Phone: (519) 693-0350<br/>Fax: (519) 693-0083</td>
<!-- <td></td> -->
<td align="center"></td>
</tr>
</table>
</div>
(Thank you for the assistance with formatting)
My regular expression to get the text "Hull, Christopher Merritt" is;
patFinderName = re.compile('<a class="doctor" href="details.aspx?view=1&id= 85956">(.*) </a>')
It keeps returning empty and I can't figure out why, anybody have any ideas?
Thank you for the answers, I've changed it to;
patFinderName = re.compile('<a class="doctor" href=".*">(.*) </a>')
Now it works beautifully.

? is a magic token in regular expressions, meaning zero or one of the previous atom. As you want a literal question mark symbol, you need to escape it.

You should escape the ? in your regex:
In [8]: re.findall('<a class="doctor" href="details.aspx\?view=1&id= 85956">(.*)</a>', text)
Out[8]: ['Hull, Christopher Merritt ']

Python BeautifulSoup parsing specific text

I am parsing an html file and I want to find the part of the file where it says "Smaller Reporting Company" and either has an "X" or Checkbox next to it or doesn't. The checkbox is typically done with the Wingdings font or an ascii code. In the HTML below you'll see it has an þ in wingdings next to it.
I have no problem showing the results of a regular expression search for the text, but I'm having trouble going the next step and looking for a check box.
I will be using this to parse a number of different html files that won't all follow the same format, but most of them will use a table and ascii text like this example.
Here is the HTML code:
<HTML>
<HEAD><TITLE></TITLE></HEAD>
<BODY>
<DIV align="left">Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, or a smaller reporting company. See the definitions of large accelerated filer, accelerated filer and smaller reporting company. (Check one):
</DIV>
<DIV align="center">
<TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">
<!-- Begin Table Head -->
<TR valign="bottom">
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
</TR>
<TR></TR>
<!-- End Table Head -->
<!-- Begin Table Body -->
<TR valign="bottom">
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Large accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap">Accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Non-accelerated filer <FONT style="font-family: Wingdings">o</FONT> </FONT>
<FONT style="white-space: nowrap">(Do not check if a smaller reporting company)</FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Smaller reporting company <FONT style="font-family: Wingdings">þ</FONT></FONT></TD>
</TR>
<!-- End Table Body -->
</TABLE>
</DIV></BODY></HTML>
Here is my Python code:
import os, sys, string, re
from BeautifulSoup import BeautifulSoup
rawDataFile = "testfile1.html"
f = open(rawDataFile)
soup = BeautifulSoup(f)
f.close()
search = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
print search
Question:
How could I set this up to have a second search that is dependent upon the first search? So when I find "smaller reporting company" I can search the next few lines to see if there is an ascii code? I've been going through the soup docs. I tried to do find and findNext but I haven't been able to get it to work.

If you know the position of the wingding character won't change, you can use .next.
>>> nodes = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
>>> nodes[-1].next.next # last item in list is the only good one... kinda crap
u'þ'
Or you can go up, and then find from there:
>>> nodes[-1].parent.find('font',style="font-family: Wingdings").next
u'þ'
Or you could do it the other way round:
>>> soup.findAll(text='þ')[0].previous.previous
u' Smaller reporting company '
This assume that you know the wingding caharcters you're looking for.
The last strategy has the added bonus of filtering out other crap that your regex is catching, which I suppose you don't really want; you can then just cycle through results knowing that you're only working on the right list, so you can peruse if to your liking.

You may try iterating through the structure and checking for values inside the inner tags or checking for values in the outer tags. I can't remember off hand how to do it and I ended up using lxml for this, but I think bsoup may be able to do this.
If you can't get bsoup to do it check out lxml. It is potentially faster depending upon what you are doing. It also has hooks for using bsoup with lxml.

lxml has a tolerant HTML parser. You don't need bsoup (which is now deprecated by its author) and you should avoid regexes for parsing HTML.
Here is a first rough cut at what you are looking for:
guff = """\
<HTML>
<HEAD><TITLE></TITLE></HEAD>
[snip]
</DIV></BODY></HTML>
"""
from lxml.html import fromstring
doc = fromstring(guff)
for td_el in doc.iter('td'):
font_els = list(td_el.iter('font'))
if not font_els: continue
print
for el in font_els:
print (el.text, el.attrib)
This produces:
(' Large accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
('Accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
(' Non-accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
('(Do not check if a smaller reporting company)', {'style': 'white-space: nowrap
'})
(' Smaller reporting company ', {'style': 'white-space: nowrap'})
(u'\xfe', {'style': 'font-family: Wingdings'})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Search in HTML page using Regex patterns with python - python

This works, but may not be very robust: import re r = re.compile('<HR>\s?<font size="\+1">(.+?)</font>\s?<BR>', re.IGNORECASE) r.findall(html) You will be better off using a proper HTML parser. BeautifulSoup is excellent and easy to use. Look it up.

re.findall(r'<HR>\n<font size="\+1">([^<]*)<\/font><BR>', html, re.MULTILINE)

Related

Beautiful Soup: extracting tagged and untagged HTML text

Extract attribute's value with XPath in Python

How to create a regex for the following scenario (HTML)?

Regex returning nothing in Python

Python BeautifulSoup parsing specific text

Categories

Resources