I'm working in Python for the first time and I've used Mechanize to search a website along with BeautifulSoup to select a particular div, now I'm trying to grab a specific sentence with a regular expression. This is the soup object's contents;
<div id="results">
<table cellspacing="0" width="100%">
<tr>
<th align="left" valign="middle" width="32%">Physician Name, (CPSO#)</th>
<th align="left" valign="middle" width="36%">Primary Practice Location</th>
<!-- <th width="16%" align="center" valign="middle">Accepting New Patients?</th> -->
<th align="center" valign="middle" width="32%">Disciplinary Info & Restrictions</th>
</tr>
<tr>
<td>
<a class="doctor" href="details.aspx?view=1&id= 85956">Hull, Christopher Merritt </a> (#85956)
</td>
<td>Four Counties Medical Clinic<br/>1824 Concessions Dr<br/>Newbury ON N0L 1Z0<br/>Phone: (519) 693-0350<br/>Fax: (519) 693-0083</td>
<!-- <td></td> -->
<td align="center"></td>
</tr>
</table>
</div>
(Thank you for the assistance with formatting)
My regular expression to get the text "Hull, Christopher Merritt" is;
patFinderName = re.compile('<a class="doctor" href="details.aspx?view=1&id= 85956">(.*) </a>')
It keeps returning empty and I can't figure out why, anybody have any ideas?
Thank you for the answers, I've changed it to;
patFinderName = re.compile('<a class="doctor" href=".*">(.*) </a>')
Now it works beautifully.
? is a magic token in regular expressions, meaning zero or one of the previous atom. As you want a literal question mark symbol, you need to escape it.
You should escape the ? in your regex:
In [8]: re.findall('<a class="doctor" href="details.aspx\?view=1&id= 85956">(.*)</a>', text)
Out[8]: ['Hull, Christopher Merritt ']
Related
I am trying to scrape some data off a website. The data that I want is listed in a table, but there are multiple tables and no ID's. I then had the idea that I would find the header just above the table I was searching for and then use that as an indicator.
This has really troubled me, so as a last resort, I wanted to ask if there were someone who knows how to BeautifulSoup to find the table.
A snipped of the HTML code is provided beneath, thanks in advance :)
The table I am interested in, is the table right beneath <h2>Mine neaste vagter</h2>
<h2>Min aktuelle vagt</h2>
<div>
<a href='/shifts/detail/595212/'>Flere detaljer</a>
<p>Vagt starter: <b>11/06 2021 - 07:00</b></p>
<p>Vagt slutter: <b>11/06 2021 - 11:00</b></p>
<h2>Masker</h2>
<table class='list'>
<tr><th>Type</th><th>Fra</th><th> </th><th>Til</th></tr>
<tr>
<td>Fri egen regningD</td>
<td>07:00</td>
<td> - </td>
<td>11:00</td>
</tr>
</table>
</div>
<hr>
<h2>Mine neaste vagter</h2>
<table class='list'>
<tr>
<th class="alignleft">Dato</th>
<th class="alignleft">Rolle</th>
<th class="alignleft">Tidsrum</th>
<th></th>
<th class="alignleft">Bytte</th>
<th class="alignleft" colspan='2'></th>
</tr>
<tr class="rowA separator">
<td>
<h3>12/6</h3>
</td>
<td>Kundeservice</td>
<td>18:00 → 21:30 (3.5 t)</td>
<td style="max-width: 20em;"></td>
<td>
<a href="/shifts/ajax/popup/595390/" class="swap shiftpop">
Byt denne vagt
</a>
</td>
<td><a href="/shifts/detail/595390/">Detaljer</td>
<td>
</td>
</tr>
Here are two approaches to find the correct <table>:
Since the table you want is the last one in the HTML, you can use find_all() and using index slicing [-1] to find the last table:
print(soup.find_all("table", class_="list")[-1])
Find the h2 element by text, and the use the find_next() method to find the table:
print(soup.find(lambda tag: tag.name == "h2" and "Mine neaste vagter" in tag.text).find_next("table"))
You can use :-soup-contains (or just :contains) to target the <h2> by its text and then use find_next to move to the table:
from bs4 import BeautifulSoup as bs
html = '''your html'''
soup = bs(html, 'lxml')
soup.select_one('h2:-soup-contains("Mine neaste vagter")').find_next('table')
This is assuming the HTML, as shown, is returned by whatever access method you are using.
I am learning Beautiful Soup and Python and in this context I am doing the "Baby names" exercise of the Google Tutorial on Regex using the set of html files that contains popular baby names for different years (e.g. baby1990.html etc). You can find this dataset if you are interested here: https://developers.google.com/edu/python/exercises/baby-names
The html files contain a particular table which store the popular baby names and whose html code is the following:
<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="formatting">
<tr valign="top"><td width="25%" class="greycell">
Background information
<p><br />
Select another <label for="yob">year of birth</label>?<br />
<form method="post" action="/cgi-bin/popularnames.cgi">
<input type="text" name="year" id="yob" size="4" value="1990">
<input type="hidden" name="top" value="1000">
<input type="hidden" name="number" value="">
<input type="submit" value=" Go "></form>
</td><td>
<h3 align="center">Popularity in 1990</h3>
<p align="center">
<table width="48%" border="1" bordercolor="#aaabbb"
cellpadding="2" cellspacing="0" summary="Popularity for top 1000">
<tr align="center" valign="bottom">
<th scope="col" width="12%" bgcolor="#efefef">Rank</th>
<th scope="col" width="41%" bgcolor="#99ccff">Male name</th>
<th scope="col" bgcolor="pink" width="41%">Female name</th></tr>
<tr align="right"><td>1</td><td>Michael</td><td>Jessica</td> # Targeted row
<tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td> # Targeted row
etc...
There is also another table in the html file that I do not want to capture and has the following html code.
<table width="100%" border="0" cellspacing="0" cellpadding="4">
<tbody>
<tr><td class="sstop" valign="bottom" align="left" width="25%">
Social Security Online
</td><td valign="bottom" class="titletext">
<!-- sitetitle -->Popular Baby Names
</td>
</tr>
<tr bgcolor="#333366"><td colspan="2" height="2"></td></tr>
<tr><td class="graystars" width="25%" valign="top">
Popular Baby Names</td><td valign="top">
<a href="http://www.ssa.gov/"><img src="/templateimages/tinylogo.gif"
width="52" height="47" align="left"
alt="SSA logo: link to Social Security home page" border="0"></a><a name="content"></a>
<h1>Popular Names by Birth Year</h1>September 12, 2007</td>
</tr>
<tr bgcolor="#333366"><td colspan="2" height="1"></td></tr>
</tbody></table>
In comparing the table Tags of the two tables I concluded that the unique characteristic of the targeted table -- the table I am trying to capture-- is the 'summary' attribute which appears to have the value 'formatting'. Therefore I tried the following command:
right_table = soup.find("table", summary = "formatting")
However, this command failed to select the targeted table.
In contrast, the following command succeeded:
table = soup.find(summary="Popularity for top 1000")
Could you explain by looking at the html code why the first command failed and the second succeeded?
Your advice will be appreciated.
I answered your question earlier, the code works.
And one more thing, html.patser is broken in python2, do not use it, use lxml.
I'm trying to find a string inside a HTML page with known patterns.
for example, in the following HTML code:
<TABLE WIDTH="100%">
<TR><TD ALIGN="LEFT" width="50%"> </TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM WIDTH=50%><FONT SIZE=-1>( <STRONG>1</STRONG></FONT> <FONT SIZE=-2>of</FONT> <STRONG><FONT SIZE=-1>1</STRONG> )</FONT></TD></TR></TABLE>
<HR>
<TABLE WIDTH="100%">
<TR> <TD ALIGN="LEFT" WIDTH="50%"><B>String 1</B></TD>
<TD ALIGN="RIGHT" WIDTH="50%"><B><A Name=h1 HREF=#h0></A><A HREF=#h2></A><B><I></I></B>String</B></TD>
</TR>
<TR><TD ALIGN="LEFT" WIDTH="50%"><b>String 2.</B>
</TD>
<TD ALIGN="RIGHT" WIDTH="50%"> <B>
String 3
</B></TD>
</TR>
</TABLE>
<HR>
<font size="+1">String 4</font><BR>
...
I want to find String 4 , and I know that it will always be between
<HR><font size="+1">
and </font><BR>
how can I search for the string using RE?
edit:
I've tried the following, but no success:
p = re.match('<HR><font size="+1">(.*?)</font><BR>',html)
thanks.
re.findall(r'<HR>\s*<font size="\+1">(.*?)</font><BR>', html, re.DOTALL)
findall is returning a list with everything that is captured between the brackets in the regular expression. I used re.DOTALL so the dot also captures end of lines.
I used \s* because I was not sure whether there would be any whitespace.
This works, but may not be very robust:
import re
r = re.compile('<HR>\s?<font size="\+1">(.+?)</font>\s?<BR>', re.IGNORECASE)
r.findall(html)
You will be better off using a proper HTML parser. BeautifulSoup is excellent and easy to use. Look it up.
re.findall(r'<HR>\n<font size="\+1">([^<]*)<\/font><BR>', html, re.MULTILINE)
I have a few known formats in an HTML page, I need to parse the content of the tags
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center> **VALUES_TO_FIND** </TD>
<TD> </TD>
</TR>
<TR>
<TD align=center> </TD>
</TR>
basically I thought I can concatenate the HTML with a regular expression that will match anything inside the spot I'm looking for.
I know that the text before and after VALUES_TO_FIND will always be the same. how can I find it using RE? (I'm dealing with several cases and the format can repeat in several places in the page.
This is what you are looking for:
import re
s="""
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center> **VALUES_TO_FIND** </TD>
<TD> </TD>
</TR>
"""
p="""
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center>(.*)</TD>
<TD> </TD>
</TR>
"""
m=re.search(p, s)
print m.group(1)
Don't use regular expression to parse HTML (It's not a regular language).
There are many threads on the topic at stackoverflow.
I recommend you to use: BeautifulSoup, Pattern and similar modules.
There are many better options for getting data out of HTML than regular expressions. Try Scrapy, for example.
HTML isn't a regular language, using regular expression to work with it is difficult.
BeautifulSoup is a nice parser, here's an example how to use it:
from BeautifulSoup import BeautifulSoup
html = u'''
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center> **VALUES_TO_FIND** </TD>
<TD> </TD>
</TR>
<TR>
<TD align=center> </TD>
</TR>'''
bs = BeautifulSoup(html)
print [td.contents for td in bs.findAll('td')]
output:
[[u'Reissue of:'], [u' **VALUES_TO_FIND** '], [u' '], [u' ']]
You know what to do from here. :)
Install with pip install BeautifulSoup. Here are the docs:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
This regular expression will do:
re.findall(r'<TR>\s+<TD.+?</TD>\s+<TD align=center>(.*?)</TD>',html,re.DOTALL)
But I recommend using a parser.
I am parsing an html file and I want to find the part of the file where it says "Smaller Reporting Company" and either has an "X" or Checkbox next to it or doesn't. The checkbox is typically done with the Wingdings font or an ascii code. In the HTML below you'll see it has an þ in wingdings next to it.
I have no problem showing the results of a regular expression search for the text, but I'm having trouble going the next step and looking for a check box.
I will be using this to parse a number of different html files that won't all follow the same format, but most of them will use a table and ascii text like this example.
Here is the HTML code:
<HTML>
<HEAD><TITLE></TITLE></HEAD>
<BODY>
<DIV align="left">Indicate by check mark whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, or a smaller reporting company. See the definitions of large accelerated filer, accelerated filer and smaller reporting company. (Check one):
</DIV>
<DIV align="center">
<TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">
<!-- Begin Table Head -->
<TR valign="bottom">
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
<TD width="3%"> </TD>
<TD width="22%"> </TD>
</TR>
<TR></TR>
<!-- End Table Head -->
<!-- Begin Table Body -->
<TR valign="bottom">
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Large accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap">Accelerated filer <FONT style="font-family: Wingdings">o</FONT></FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Non-accelerated filer <FONT style="font-family: Wingdings">o</FONT> </FONT>
<FONT style="white-space: nowrap">(Do not check if a smaller reporting company)</FONT>
</TD>
<TD> </TD>
<TD align="center" valign="top"><FONT style="white-space: nowrap"> Smaller reporting company <FONT style="font-family: Wingdings">þ</FONT></FONT></TD>
</TR>
<!-- End Table Body -->
</TABLE>
</DIV></BODY></HTML>
Here is my Python code:
import os, sys, string, re
from BeautifulSoup import BeautifulSoup
rawDataFile = "testfile1.html"
f = open(rawDataFile)
soup = BeautifulSoup(f)
f.close()
search = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
print search
Question:
How could I set this up to have a second search that is dependent upon the first search? So when I find "smaller reporting company" I can search the next few lines to see if there is an ascii code? I've been going through the soup docs. I tried to do find and findNext but I haven't been able to get it to work.
If you know the position of the wingding character won't change, you can use .next.
>>> nodes = soup.findAll(text=re.compile('[sS]maller.*[rR]eporting.*[cC]ompany'))
>>> nodes[-1].next.next # last item in list is the only good one... kinda crap
u'þ'
Or you can go up, and then find from there:
>>> nodes[-1].parent.find('font',style="font-family: Wingdings").next
u'þ'
Or you could do it the other way round:
>>> soup.findAll(text='þ')[0].previous.previous
u' Smaller reporting company '
This assume that you know the wingding caharcters you're looking for.
The last strategy has the added bonus of filtering out other crap that your regex is catching, which I suppose you don't really want; you can then just cycle through results knowing that you're only working on the right list, so you can peruse if to your liking.
You may try iterating through the structure and checking for values inside the inner tags or checking for values in the outer tags. I can't remember off hand how to do it and I ended up using lxml for this, but I think bsoup may be able to do this.
If you can't get bsoup to do it check out lxml. It is potentially faster depending upon what you are doing. It also has hooks for using bsoup with lxml.
lxml has a tolerant HTML parser. You don't need bsoup (which is now deprecated by its author) and you should avoid regexes for parsing HTML.
Here is a first rough cut at what you are looking for:
guff = """\
<HTML>
<HEAD><TITLE></TITLE></HEAD>
[snip]
</DIV></BODY></HTML>
"""
from lxml.html import fromstring
doc = fromstring(guff)
for td_el in doc.iter('td'):
font_els = list(td_el.iter('font'))
if not font_els: continue
print
for el in font_els:
print (el.text, el.attrib)
This produces:
(' Large accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
('Accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
(' Non-accelerated filer ', {'style': 'white-space: nowrap'})
('o', {'style': 'font-family: Wingdings'})
('(Do not check if a smaller reporting company)', {'style': 'white-space: nowrap
'})
(' Smaller reporting company ', {'style': 'white-space: nowrap'})
(u'\xfe', {'style': 'font-family: Wingdings'})