Problem extracting text out of html file using python regex - python

I'm working on a project that requires me to write some code to pull out some text from a html file in python.
<tr>
<td>Target binary file name:</td>
<td class="right">Doc1.docx</td>
</tr>
^Small portion of the html file that I'm interested in.
#! /usr/bin/python
import os
import re
if __name__ == '__main__':
f = open('./results/sample_result.html')
soup = f.read()
p = re.compile("binary")
for line in soup:
m = p.search(line)
if m:
print "finally"
break
^Sample code I wrote to test if I could extract data out.
I've written several programs similar to this to extract text from txt files almost exactly the same and they have worked just fine. Is there something I'm missing out with regards to regex and html?

Is there something I'm missing out with regards to regex and html?
Yes. You're missing the fact that some HTML cannot be parsed with a simple regex.

Is this actually what you're trying to do, or just a simple example for a more complicated regex later? If the latter, listen to everyone else. If the former:
for line in file:
if "binary" in line:
# do stuff
If that doesn't work, are you sure "binary" is in the file? Not, I don't know, "<i>b</i>inary"?

HTML as understood by browsers is waaaay too flexible for reg expressions. Attributes can pop up in any tag, and in any order, and in upper or lower case, and with or without quotation marks about the value. Special emphasis tags can show up anywhere. Whitespace is significant in regex, but not so much in HTML, so your regex has to be littered with \s*'s everywhere. There is no requirement that opening tags be matched with closing tags. Some opening tags include a trailing '/', meaning that they are empty tags (no body, no closing tag). Lastly, HTML is often nested, which is pretty much off the chart as far as regex is concerned.

Related

Finding a random sentence in HTML with python regex

I'm trying to write a small function for another script that pulls the generated text from "http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1"
Essentially, I need it to pull whatever sentence is between < br> tags.
I've been trying my darndest using regular expressions, but I never really could get the hang of those.
All of the searching I did turned up things for pulling either specific sentences, or single words.
This however needs to pull whatever arbitrary string is between < br> tags.
Can anyone help me out? Thanks.
Best I could come up with:
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('\<br>.*\<br>', html)
EDIT: Ended up going with a different approach all together, simply splitting the HTML in a list seperated by < br> and pulling [3], made for cleaner code and less string operations. Keeping this question up for future reference and other people with similar questions.
You need to use the DOTALL flag as there are newlines in the expression that you need to match. I would use
re.findall('<br>(.*?)<br>', html, re.S)
However will return multiple results as there are a bunch of <br><br> on that page. You may want to use the more specific:
re.findall('<hr><br>(.*?)<br><hr>', html, re.S)
from urllib import urlopen
import re
html = urlopen("http://subfusion.net/cgi-bin/quote.pl?quote=humorists&number=1").read()
output = re.findall('<body>.*?>\n*([^<]{5,})<.*?</body>', html, re.S)
if (len(output) > 0):
print(output)
output = re.sub('\n', ' ', output[0])
output = re.sub('\t', '', output)
print(output)
Terminal
imac2011:Desktop allendar$ python test.py
['A black cat crossing your path signifies that the animal is going somewhere.\n\t\t-- Groucho Marx\n\n']
A black cat crossing your path signifies that the animal is going somewhere. -- Groucho Marx
You could also strip of the final \n's and replace all those inside the text (on longer quotes) with <br /> if you are displaying it in HTML again, so you would maintain the original line breaks visually.
All jokes of that page have the same model, no ambigous things, you can use this
output = re.findall('(?<=<br>\s)[^<]+(?=\s{2}<br)', html)
No need to use the dotall flag cause there's no dot.
This is uh, 7 years later, but for future reference:
Use the beautifulsoup library for these kind of purposes, as suggested by Floris in the comments.

Regex for extracting all regular text from html in python [duplicate]

This question already has answers here:
regular expression to extract text from HTML
(11 answers)
Closed 10 years ago.
how do i extract everythin that is not an html tag from a partial html text?
That is, if I have something of the type:
<div>Hello</div><h3><div>world</div></h3>
I want to extract ['Hello','world']
I thought about the Regex:
>[a-zA-Z0-9]+<
but it will not include special characters and chinese or hebrew characters, which I need
You should look at something like regular expression to extract text from HTML
From that post:
You can't really parse HTML with regular expressions. It's too
complex. RE's won't handle will work in
a browser as proper text, but might baffle a naive RE.
You'll be happier and more successful with a proper HTML parser.
Python folks often use something Beautiful Soup to parse HTML and
strip out tags and scripts.
Also, browsers, by design, tolerate malformed HTML. So you will often
find yourself trying to parse HTML which is clearly improper, but
happens to work okay in a browser.
You might be able to parse bad HTML with RE's. All it requires is
patience and hard work. But it's often simpler to use someone else's
parser.
As Avi already pointed, this is too complex task for regular expressions. Use get_text from BeautifulSoup or clean_html from nltk to extract text from your html.
from bs4 import BeautifulSoup
clean_text = BeautifulSoup(html).get_text()
or
import nltk
clean_text = nltk.clean_html(html)
Another option, thanks to GuillaumeA, is to use pyquery:
from pyquery import PyQuery
clean_text = PyQuery(html)
It must be said that the above mentioned html parsers will do the job with varying level of success if the html is not well formed, so you should experiment and see what works best for your input data.
I am not familiar with Python , but the following regular expression can help you.
<\s*(\w+)[^/>]*>
where,
<: starting character
\s*: it may have whitespaces before tag name (ugly but possible).
(\w+): tags can contain letters and numbers (h1). Well, \w also matches '_', but it does not hurt I guess. If curious use ([a-zA-Z0-9]+) instead.
[^/>]*: anything except > and / until closing >
\>: closing >

Bad named links search and replace

The problem i'm facing is badly named links...
There are few hundred bad links in different files.
So I write bash to replace links
<a href="../../../external.html?link=http://www.twitter.com"><a href="../../external.html?link=http://www.facebook.com/pages/somepage/">
<a href="../external.html?link=http://www.tumblr.com/">
to direct links like
<a href="http://www.twitter.com>
I know we have pattern ../ repeating one or more times. Also external.html?link which also should be removed.
How would recommend to do this? awk, sed, maybe python??
Will i need regex?
Thanks for opinions...
This could be a place where regular expressions are the correct solution. You are only searching for text in attributes, and the contents are regular, fitting a pattern.
The following python regular expression would locate these links for you:
r'href="((?:\.\./)+external\.html\?link=)([^"]+)"'
The pattern we look for is something inside a href="" chunk of text, where that 'something' starts with one or more instances of ../, followed by external.html?link=, then followed with any text that does not contain a " quote.
The matched text after the equals sign is grouped in group 2 for easy retrieval, group 1 holds the ../../external.html?link= part.
If all you want to do is remove the ../../external.html?link= part altogether (so the links point directly to the endpoint instead of going via the redirect page), leave off the first group and do a simple .sub() on your HTML files:
import re
redirects = re.compile(r'href="(?:\.\./)+external\.html\?link=([^"]+)"')
# ...
redirects.sub(r'href="\1"', somehtmlstring)
Note that this could also match any body text (so outside HTML tags), this is not a HTML-aware solution. Chances are there is no such body text though. But if there is, you'll need a full-blown HTML parser like BeautifulSoup or lxml instead.
Use a HTML parser like BeautifulSoup or lxml.html.

strip only html anchor tags

i have following code that strip all tags. now i want to strip only anchor tags.
x = re.compile(r'<[^<]*?/?>')
how to modify so that only anchor tags stripped.
following code that strip all tags.
Not really. <div title="a>b"> is valid HTML and gets mangled. <div title="<" onmouseover="script()" class="<">"> is invalid HTML but the kind of thing you will often find on real web pages. Your regexp leaves an active tag with dangerous scripting in it.
You can't do an HTML-processing task like tag-stripping with regex, unless your possible input set is heavily restricted. Better to use a real HTML parser and walk across the resulting document removing unwanted elements as you go.
eg. with BeautifulSoup:
def replaceWithContents(element):
ix= element.parent.contents.index(element)
for child in reversed(element.contents):
element.parent.insert(ix, child)
element.extract()
doc= BeautifulSoup(html) # maybe fromEncoding= 'utf-8'
for link in doc.findAll('a'):
replaceWithContents(link)
str(doc)
x = re.compile(r'<[aA]\>[^<]*?/?>')
This will match the 'a' or 'A' followed by a word boundary. Note that it won't clean out the closing tag.
x = re.compile(r'</?[aA]\>[^<]*?/?>')
will remove the closing tag as well.
EDIT:
Actually, it feels more reliable to switch the [^<] to [^>], like so.
x = re.compile(r'</?[aA]\>[^>]*?/?>')
I'm not sure if this Python is correct (I'm a PHP guy but am just starting to learn python in my own time).
re.sub('<[aA][^>]*>([^<]+)</[aA]>','\1','<html><head> .... </body></html>')
This won't remove all anchor tags in one shot, so you may have to loop over the html string. It matches the anchor tags and replaces the match with the contents of the tags. So ...
homepage -> homepage
Might not be the most efficient on a large body of text but works.

skip over HTML tags in Regular Expression patterns

I'm trying to write a regular expression pattern (in python) for reformatting these template engine files.
Basically the scheme looks like this:
[$$price$$]
{
<h3 class="price">
$12.99
</h3>
}
I'm trying to make it remove any extra tabs\spaces\new lines so it should look like this:
[$$price$$]{<h3 class="price">$12.99</h3>}
I wrote this: (\t|\s)+? which works except it matches within the html tags, so h3 becomes h3class and I am unable to figure out how to make it ignore anything inside the tags.
Using regular expressions to deal with HTML is extremely error-prone; they're simply not the right tool.
Instead, use a HTML/XML-aware library (such as lxml) to build a DOM-style object tree; modify the text segments within the tree in-place, and generate your output again using said library.
Try this:
\r?\n[ \t]*
EDIT: The idea is to remove all newlines (either Unix: "\n", or Windows: "\r\n") plus any horizontal whitespace (TABs or spaces) that immediately follow them.
Alan,
I have to agree with Charles that the safest way is to parse the HTML, then work on the Text nodes only. Sounds overkill but that's the safest.
On the other hand, there is a way in regex to do that as long as you trust that the HTML code is correct (i.e. does not include invalid < and > in the tags as in: <a title="<this is a test>" href="look here">...)
Then, you know that any text has to be between > and < except at the very beginning and end (if you just get a snapshot of the page, otherwise there is the HTML tag minimum.)
So... You still need two regex's: find the text '>[^<]+<', then apply the other regex as you mentioned.
The other way, is to have an or with something like this (not tested!):
'(<[^>]*>)|([\r\n\f ]+)'
This will either find a tag or spaces. When you find a tag, do not replace, if you don't find a tag, replace with an empty string.

Categories