need to selectively escape html entities (&) - python

I'm scraping a html page, then using xml.dom.minidom.parseString() to create a dom object.
however, the html page has a '&'. I can use cgi.escape to convert this into & but it also converts all my html <> tags into <> which makes parseString() unhappy.
how do i go about this? i would rather not just hack it and straight replace the "&"s
thanks

For scraping, try to use a library that can handle such html "tag soup", like lxml, which has a html parser (as well as a dedicated html package in lxml.html), or BeautifulSoup (you will also find that these libraries also contain other stuff that makes scraping/working with html easier, aside from being able to handle ill-formed documents: getting information out of forms, making hyperlinks absolute, using css selectors...)

i would rather not just hack it and
straight replace the "&"s
Er, why? That's what cgi.escape is doing - effectively just a search and replace operation for certain characters that have to be escaped.
If you only want to replace a single character, just replace the single character:
yourstring.replace('&', '&')
Don't beat around the bush.

If you want to make sure that you don't accidentally re-escape an already escaped & (i. e. not transform & into &amp; or ß into &szlig;), you could
import re
newstring = re.sub(r"&(?![A-Za-z])", "&", oldstring)
This will leave &s alone when they are followed by a letter.

You shouldn't use an XML parser to parse data that isn't XML. Find an HTML parser instead, you'll be happier in the long run. The standard library has a few (HTMLParser and htmllib), and BeautifulSoup is a well-loved third-party package.

Related

Regular Expression - HTML

I am kind of new to regular expressions, but the one i made myself doesn't work. It is supposed to give me data from a websites html.
I basically want to get this out of html, and all of the multiple ones. I have the page url as a string btw.
Co-Op
And what i've done for my regexp is:
<a\bhref="http://store.steampowered.com/search/?category2=2"\bclass="name"*>(.*?)</a>\g
You should never parse HTML/XML or any other language that allows cascading using regular expressions.
A nice thing with HTML however, is that it can be converted to XML and XML has a nice toolkit for parsing:
echo 'Co-Op' | tidy -asxhtml -numeric 2> /dev/null | xmllint --html --xpath 'normalize-space(//a[#class="name" and #href="http://store.steampowered.com/search/?category2=2"])' - 2>/dev/null
With query:
normalize-space(//a[#class="name" and #href="http://store.steampowered.com/search/?category2=2"])
// means any tag (regardless of it's depth), a means the a tag, and we furthermore specify the constraints that class=name and href=(the link). And then we returned the normalize-space content between the such tag <a> and </a>.
In Python you can use:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://store.steampowered.com/app/24860/").read()
soup = BeautifulSoup(page)
print soup.find_all('a',attrs={'class':'name','href':'http://store.steampowered.com/search/?category2=2'})
Comment on your regex:
the problem is that it contains tokens like ? that are interpreted as regex-directives rather than characters. You need to escape them. It should probably read:
<a\s+href="http://store\.steampowered\.com/search/\?category2=2"\s+class="name"\S*>(.*?)</a>\g
I also replaced \b with \s, \s means space characters like space, tab, new line. Although the regex is quite fragile: if one ever decides to swap href and class, the program has a problem. For most of these problems, there are indeed solutions, but you better use an XML analysis tool.

Regex for extracting all regular text from html in python [duplicate]

This question already has answers here:
regular expression to extract text from HTML
(11 answers)
Closed 10 years ago.
how do i extract everythin that is not an html tag from a partial html text?
That is, if I have something of the type:
<div>Hello</div><h3><div>world</div></h3>
I want to extract ['Hello','world']
I thought about the Regex:
>[a-zA-Z0-9]+<
but it will not include special characters and chinese or hebrew characters, which I need
You should look at something like regular expression to extract text from HTML
From that post:
You can't really parse HTML with regular expressions. It's too
complex. RE's won't handle will work in
a browser as proper text, but might baffle a naive RE.
You'll be happier and more successful with a proper HTML parser.
Python folks often use something Beautiful Soup to parse HTML and
strip out tags and scripts.
Also, browsers, by design, tolerate malformed HTML. So you will often
find yourself trying to parse HTML which is clearly improper, but
happens to work okay in a browser.
You might be able to parse bad HTML with RE's. All it requires is
patience and hard work. But it's often simpler to use someone else's
parser.
As Avi already pointed, this is too complex task for regular expressions. Use get_text from BeautifulSoup or clean_html from nltk to extract text from your html.
from bs4 import BeautifulSoup
clean_text = BeautifulSoup(html).get_text()
or
import nltk
clean_text = nltk.clean_html(html)
Another option, thanks to GuillaumeA, is to use pyquery:
from pyquery import PyQuery
clean_text = PyQuery(html)
It must be said that the above mentioned html parsers will do the job with varying level of success if the html is not well formed, so you should experiment and see what works best for your input data.
I am not familiar with Python , but the following regular expression can help you.
<\s*(\w+)[^/>]*>
where,
<: starting character
\s*: it may have whitespaces before tag name (ugly but possible).
(\w+): tags can contain letters and numbers (h1). Well, \w also matches '_', but it does not hurt I guess. If curious use ([a-zA-Z0-9]+) instead.
[^/>]*: anything except > and / until closing >
\>: closing >

Use Python to parse html data which contains "&"

I'm using the python library SGMLParser to parse some html.
I encounter an html tag of the form
<td class="school">Texas A&M</td>
I'd like to read out "Texas A&M". But when handle_data gets called, it gets called with "Texas A", and then, separately, "M" (quotes for clarity).
How do I replace the
&
string with an & before the call, without replacing all special ampersands in the whole string (some of which I may need).
Thanks!
If you switch from the deprecated SGMLParser to a modern alternative such as LXML (which also handles HTML), this becomes trivial:
>>> etree.fromstring('''<td class="school">Texas A&M</td>''').text
'Texas A&M'
SGMLParser has convert_entityref() method, but instead of deprecated SGMLParser I would recommend using lxml or Beautiful Soup which have better parser API.
Entity references like & are handled by handle_entity. Check that this method knows how to translate &. The default implementation should call handle_data('&'), but you may have accidentally overwritten it.
Also, if possible, consider using the far more advanced lxml instead.

skip over HTML tags in Regular Expression patterns

I'm trying to write a regular expression pattern (in python) for reformatting these template engine files.
Basically the scheme looks like this:
[$$price$$]
{
<h3 class="price">
$12.99
</h3>
}
I'm trying to make it remove any extra tabs\spaces\new lines so it should look like this:
[$$price$$]{<h3 class="price">$12.99</h3>}
I wrote this: (\t|\s)+? which works except it matches within the html tags, so h3 becomes h3class and I am unable to figure out how to make it ignore anything inside the tags.
Using regular expressions to deal with HTML is extremely error-prone; they're simply not the right tool.
Instead, use a HTML/XML-aware library (such as lxml) to build a DOM-style object tree; modify the text segments within the tree in-place, and generate your output again using said library.
Try this:
\r?\n[ \t]*
EDIT: The idea is to remove all newlines (either Unix: "\n", or Windows: "\r\n") plus any horizontal whitespace (TABs or spaces) that immediately follow them.
Alan,
I have to agree with Charles that the safest way is to parse the HTML, then work on the Text nodes only. Sounds overkill but that's the safest.
On the other hand, there is a way in regex to do that as long as you trust that the HTML code is correct (i.e. does not include invalid < and > in the tags as in: <a title="<this is a test>" href="look here">...)
Then, you know that any text has to be between > and < except at the very beginning and end (if you just get a snapshot of the page, otherwise there is the HTML tag minimum.)
So... You still need two regex's: find the text '>[^<]+<', then apply the other regex as you mentioned.
The other way, is to have an or with something like this (not tested!):
'(<[^>]*>)|([\r\n\f ]+)'
This will either find a tag or spaces. When you find a tag, do not replace, if you don't find a tag, replace with an empty string.

Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
http://www.ptop.se
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?
If you're only looking for one:
import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
print(match.group(1))
If you have a long string, and want every instance of the pattern in it:
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))
Where s is the string that you're looking for matches in.
Quick explanation of the regexp bits:
r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)
"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.
Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."
"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.
The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.
It's pretty easy to do:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
print(tag['href'])
Once you've installed BeautifulSoup, anyway.
Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.
this should work, although there might be more elegant ways.
import re
url='http://www.ptop.se'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)
John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:
http://daringfireball.net/2009/11/liberal_regex_for_matching_urls
If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.
Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.
this regex can help you, you should get the first group by \1 or whatever method you have in your language.
href="([^"]*)
example:
amgheziName
result:
http://www.amghezi.com
There's tonnes of them on regexlib
Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.
This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/
(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)
Oputput:
Match 1. /wiki/Main_Page
Match 2. /wiki/Portal:Contents
Match 3. /wiki/Portal:Featured_content
Match 4. /wiki/Portal:Current_events
Match 5. /wiki/Special:Random
Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
You can use this.
<a[^>]+href=["'](.*?)["']

Categories