How to create a regex for the following scenario (HTML)? - python

I have a few known formats in an HTML page, I need to parse the content of the tags
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center> **VALUES_TO_FIND** </TD>
<TD> </TD>
</TR>
<TR>
<TD align=center> </TD>
</TR>
basically I thought I can concatenate the HTML with a regular expression that will match anything inside the spot I'm looking for.
I know that the text before and after VALUES_TO_FIND will always be the same. how can I find it using RE? (I'm dealing with several cases and the format can repeat in several places in the page.

This is what you are looking for:
import re
s="""
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center> **VALUES_TO_FIND** </TD>
<TD> </TD>
</TR>
"""
p="""
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center>(.*)</TD>
<TD> </TD>
</TR>
"""
m=re.search(p, s)
print m.group(1)

Don't use regular expression to parse HTML (It's not a regular language).
There are many threads on the topic at stackoverflow.
I recommend you to use: BeautifulSoup, Pattern and similar modules.

There are many better options for getting data out of HTML than regular expressions. Try Scrapy, for example.

HTML isn't a regular language, using regular expression to work with it is difficult.
BeautifulSoup is a nice parser, here's an example how to use it:
from BeautifulSoup import BeautifulSoup
html = u'''
<TR>
<TD align=center>Reissue of:</TD>
<TD align=center> **VALUES_TO_FIND** </TD>
<TD> </TD>
</TR>
<TR>
<TD align=center> </TD>
</TR>'''
bs = BeautifulSoup(html)
print [td.contents for td in bs.findAll('td')]
output:
[[u'Reissue of:'], [u' **VALUES_TO_FIND** '], [u' '], [u' ']]
You know what to do from here. :)
Install with pip install BeautifulSoup. Here are the docs:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html

This regular expression will do:
re.findall(r'<TR>\s+<TD.+?</TD>\s+<TD align=center>(.*?)</TD>',html,re.DOTALL)
But I recommend using a parser.

Related

Use beautifulSoup to find a table after a header?

I am trying to scrape some data off a website. The data that I want is listed in a table, but there are multiple tables and no ID's. I then had the idea that I would find the header just above the table I was searching for and then use that as an indicator.
This has really troubled me, so as a last resort, I wanted to ask if there were someone who knows how to BeautifulSoup to find the table.
A snipped of the HTML code is provided beneath, thanks in advance :)
The table I am interested in, is the table right beneath <h2>Mine neaste vagter</h2>
<h2>Min aktuelle vagt</h2>
<div>
<a href='/shifts/detail/595212/'>Flere detaljer</a>
<p>Vagt starter: <b>11/06 2021 - 07:00</b></p>
<p>Vagt slutter: <b>11/06 2021 - 11:00</b></p>
<h2>Masker</h2>
<table class='list'>
<tr><th>Type</th><th>Fra</th><th> </th><th>Til</th></tr>
<tr>
<td>Fri egen regningD</td>
<td>07:00</td>
<td> - </td>
<td>11:00</td>
</tr>
</table>
</div>
<hr>
<h2>Mine neaste vagter</h2>
<table class='list'>
<tr>
<th class="alignleft">Dato</th>
<th class="alignleft">Rolle</th>
<th class="alignleft">Tidsrum</th>
<th></th>
<th class="alignleft">Bytte</th>
<th class="alignleft" colspan='2'></th>
</tr>
<tr class="rowA separator">
<td>
<h3>12/6</h3>
</td>
<td>Kundeservice</td>
<td>18:00 → 21:30 (3.5 t)</td>
<td style="max-width: 20em;"></td>
<td>
<a href="/shifts/ajax/popup/595390/" class="swap shiftpop">
Byt denne vagt
</a>
</td>
<td><a href="/shifts/detail/595390/">Detaljer</td>
<td>
</td>
</tr>
Here are two approaches to find the correct <table>:
Since the table you want is the last one in the HTML, you can use find_all() and using index slicing [-1] to find the last table:
print(soup.find_all("table", class_="list")[-1])
Find the h2 element by text, and the use the find_next() method to find the table:
print(soup.find(lambda tag: tag.name == "h2" and "Mine neaste vagter" in tag.text).find_next("table"))
You can use :-soup-contains (or just :contains) to target the <h2> by its text and then use find_next to move to the table:
from bs4 import BeautifulSoup as bs
html = '''your html'''
soup = bs(html, 'lxml')
soup.select_one('h2:-soup-contains("Mine neaste vagter")').find_next('table')
This is assuming the HTML, as shown, is returned by whatever access method you are using.

How to extract elements from html with BeautifulSoup

I am beginning to learn python and would like to try to use BeautifulSoup to extract the elements in the below html.
This html is taken from a voice recording system that logs the time and date in local time, UTC, call duration, called number, name, calling number, name, etc
There are usually hundreds of these entries.
What I am attempting to do is extract the elements and print them in one line to a comma delimited format in order to compare with call detail records from call manager. This will help to verify that all calls were recorded and not missed.
I believe BeautifulSoup is the right tool to do this.
Could someone point me in the right direction?
<tbody>
<tr class="formRowLight">
<td class="formRowLight" >24/10/16<br>16:24:47</td>
<td class="formRowLight" >24/10/16 07:24:47</td>
<td class="formRowLight" >00:45</td>
<td class="formRowLight" >31301</td>
<td class="formRowLight" >Joe Smith</td>
<td class="formRowLight" >31111</td>
<td class="formRowLight" >Jane Doe</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >1432875648934</td>
<td align="center" class"formRowLight"> </td>
<tr class="formRowLight">
<td class="formRowLight" >24/10/16<br>17:33:02</td>
<td class="formRowLight" >24/10/16 08:33:02</td>
<td class="formRowLight" >00:58</td>
<td class="formRowLight" >35664</td>
<td class="formRowLight" >Billy Bob</td>
<td class="formRowLight" >227045665</td>
<td class="formRowLight" >James Dean</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >9934959586849</td>
<td align="center" class"formRowLight"> </td>
</tr>
</tbody>
The pandas.read_html() would make things much easier - it would convert your tabular data from the HTML table into a dataframe which, if needed, you can later dump into CSV.
Here is a sample code to get you started:
import pandas as pd
data = """
<table>
<thead>
<tr>
<th>Date</th>
<th>Name</th>
<th>ID</th>
</tr>
</thead>
<tbody>
<tr class="formRowLight">
<td class="formRowLight">24/10/16<br>16:24:47</td>
<td class="formRowLight">Joe Smith</td>
<td class="formRowLight">1432875648934</td>
</tr>
<tr class="formRowLight">
<td class="formRowLight">24/10/16<br>17:33:02</td>
<td class="formRowLight">Billy Bob</td>
<td class="formRowLight">9934959586849</td>
</tr>
</tbody>
</table>"""
df = pd.read_html(data)[0]
print(df.to_csv(index=False))
Prints:
Date,Name,ID
24/10/1616:24:47,Joe Smith,1432875648934
24/10/1617:33:02,Billy Bob,9934959586849
FYI, read_html() actually uses BeautifulSoup to parse HTML under-the-hood.
import BeautifulSoup
import urllib2
import requests
request = urllib2.Request(your url)
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
mylist = []
div = soup.findAll('tr', {"class":"formRowLight"})
for line in div:
text= video.findNext('td',{"class":"formRowLight"}).text
mylist.append(text)
print mylist
But you need to edit this code a litt to prevent any duplicated content.
Yes, BeautifulSoup is a good tool to reach for in this problem. Something to get you started would be as follows:
from bs4 import BeautifulSoup
with open("my_log.html") as log_file:
html = log_file.read()
soup = BeautifulSoup(html)
#normally you specify a parser too `(html, 'lxml')` for example
#without specifying a parser, it will warn you and select one automatically
table_rows = soup.find_all("tr") #get list of all <tr> tags
for row in table_rows:
table_cells = row.find_all("td") #get list all <td> tags in row
joined_text = ",".join(cell.get_text() for cell in table_cells)
print(joined_text)
However, pandas's read_html may make this a bit more seamless, as mentioned in another answer to this question. Arguably pandas may be a better hammer to hit this nail with, but learning to use BeautifulSoup for this will also give you the skills to scrape all kinds of HTML in the future.
First get list of html strings, To get that follow this Convert BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements
Then perform following operation in that,
This will fetch you all values of elements you desire !
for element in html_list:
output = soup.select(element)[0].text
print("%s ," % output)
This will give you what you desires,
Hope that helps !

£ displaying in urllib2 and Beautiful Soup

I'm trying to write a small web scraper in python, and I think I've run into an encoding issue. I'm trying to scrape http://www.resident-music.com/tickets (specifically the table on the page) - a row might look something like this -
<tr>
<td style="width:64.9%;height:11px;">
<p><strong>the great escape 2017 local early bird tickets, selling fast</strong></p>
</td>
<td style="width:13.1%;height:11px;">
<p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
</td>
<td style="width:15.42%;height:11px;">
<p><strong>various</strong></p>
</td>
<td style="width:6.58%;height:11px;">
<p><strong>£55.00</strong></p>
</td>
</tr>
I'm essentially trying to replace the £55.00 with £55, and any other 'non-text' nasties.
I've tried a few different encoding things you can go with beautifulsoup, and urllib2 - to no avail, I think I'm just doing it all wrong.
Thanks
You want to unescape the html which you can do using html.unescape in python3:
In [14]: from html import unescape
In [15]: h = """<tr>
....: <td style="width:64.9%;height:11px;">
....: <p><strong>the great escape 2017 local early bird tickets, selling fast</strong></p>
....: </td>
....: <td style="width:13.1%;height:11px;">
....: <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
....: </td>
....: <td style="width:15.42%;height:11px;">
....: <p><strong>various</strong></p>
....: </td>
....: <td style="width:6.58%;height:11px;">
....: <p><strong>£55.00</strong></p>
....: </td>
....: </tr>"""
In [16]:
In [16]: print(unescape(h))
<tr>
<td style="width:64.9%;height:11px;">
<p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p>
</td>
<td style="width:13.1%;height:11px;">
<p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
</td>
<td style="width:15.42%;height:11px;">
<p><strong>various</strong></p>
</td>
<td style="width:6.58%;height:11px;">
<p><strong>£55.00</strong></p>
</td>
</tr>
For python2 use:
In [6]: from html.parser import HTMLParser
In [7]: unescape = HTMLParser().unescape
In [8]: print(unescape(h))
<tr>
<td style="width:64.9%;height:11px;">
<p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p>
</td>
<td style="width:13.1%;height:11px;">
<p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
</td>
<td style="width:15.42%;height:11px;">
<p><strong>various</strong></p>
</td>
<td style="width:6.58%;height:11px;">
<p><strong>£55.00</strong></p>
</td>
You can see both correctly unescape all entities not just the pound sign.
I used requests for this but hopefully you can do that using urllib2 also. So here is the code:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(requests.get('your_url').text)
chart = soup.findAll(name='tr')
print str(chart).replace('£',unichr(163)) #replace '£' with '£'
Now you should take the expected output!
Sample output:
...
<strong>£71.50</strong></p>
...
Anyway about the parsing you can do it with many ways, what was interesting here is: print str(chart).replace('£',unichr(163)) which was quite challenging :)
Update
If you want to escape more than one (or even one) characters (like dashes,pounds etc...), it would be easier/more efficient for you to use a parser as in Padraic's answer. Sometimes as you will also read in the comments they handle and other encoding issues.

Search in HTML page using Regex patterns with python

I'm trying to find a string inside a HTML page with known patterns.
for example, in the following HTML code:
<TABLE WIDTH="100%">
<TR><TD ALIGN="LEFT" width="50%"> </TD>
<TD ALIGN=RIGHT VALIGN=BOTTOM WIDTH=50%><FONT SIZE=-1>( <STRONG>1</STRONG></FONT> <FONT SIZE=-2>of</FONT> <STRONG><FONT SIZE=-1>1</STRONG> )</FONT></TD></TR></TABLE>
<HR>
<TABLE WIDTH="100%">
<TR> <TD ALIGN="LEFT" WIDTH="50%"><B>String 1</B></TD>
<TD ALIGN="RIGHT" WIDTH="50%"><B><A Name=h1 HREF=#h0></A><A HREF=#h2></A><B><I></I></B>String</B></TD>
</TR>
<TR><TD ALIGN="LEFT" WIDTH="50%"><b>String 2.</B>
</TD>
<TD ALIGN="RIGHT" WIDTH="50%"> <B>
String 3
</B></TD>
</TR>
</TABLE>
<HR>
<font size="+1">String 4</font><BR>
...
I want to find String 4 , and I know that it will always be between
<HR><font size="+1">
and </font><BR>
how can I search for the string using RE?
edit:
I've tried the following, but no success:
p = re.match('<HR><font size="+1">(.*?)</font><BR>',html)
thanks.
re.findall(r'<HR>\s*<font size="\+1">(.*?)</font><BR>', html, re.DOTALL)
findall is returning a list with everything that is captured between the brackets in the regular expression. I used re.DOTALL so the dot also captures end of lines.
I used \s* because I was not sure whether there would be any whitespace.
This works, but may not be very robust:
import re
r = re.compile('<HR>\s?<font size="\+1">(.+?)</font>\s?<BR>', re.IGNORECASE)
r.findall(html)
You will be better off using a proper HTML parser. BeautifulSoup is excellent and easy to use. Look it up.
re.findall(r'<HR>\n<font size="\+1">([^<]*)<\/font><BR>', html, re.MULTILINE)

Regex returning nothing in Python

I'm working in Python for the first time and I've used Mechanize to search a website along with BeautifulSoup to select a particular div, now I'm trying to grab a specific sentence with a regular expression. This is the soup object's contents;
<div id="results">
<table cellspacing="0" width="100%">
<tr>
<th align="left" valign="middle" width="32%">Physician Name, (CPSO#)</th>
<th align="left" valign="middle" width="36%">Primary Practice Location</th>
<!-- <th width="16%" align="center" valign="middle">Accepting New Patients?</th> -->
<th align="center" valign="middle" width="32%">Disciplinary Info & Restrictions</th>
</tr>
<tr>
<td>
<a class="doctor" href="details.aspx?view=1&id= 85956">Hull, Christopher Merritt </a> (#85956)
</td>
<td>Four Counties Medical Clinic<br/>1824 Concessions Dr<br/>Newbury ON N0L 1Z0<br/>Phone: (519) 693-0350<br/>Fax: (519) 693-0083</td>
<!-- <td></td> -->
<td align="center"></td>
</tr>
</table>
</div>
(Thank you for the assistance with formatting)
My regular expression to get the text "Hull, Christopher Merritt" is;
patFinderName = re.compile('<a class="doctor" href="details.aspx?view=1&id= 85956">(.*) </a>')
It keeps returning empty and I can't figure out why, anybody have any ideas?
Thank you for the answers, I've changed it to;
patFinderName = re.compile('<a class="doctor" href=".*">(.*) </a>')
Now it works beautifully.
? is a magic token in regular expressions, meaning zero or one of the previous atom. As you want a literal question mark symbol, you need to escape it.
You should escape the ? in your regex:
In [8]: re.findall('<a class="doctor" href="details.aspx\?view=1&id= 85956">(.*)</a>', text)
Out[8]: ['Hull, Christopher Merritt ']

Categories