Regular Expression - HTML

Regular Expression - HTML - python

I am kind of new to regular expressions, but the one i made myself doesn't work. It is supposed to give me data from a websites html.
I basically want to get this out of html, and all of the multiple ones. I have the page url as a string btw.
Co-Op
And what i've done for my regexp is:
<a\bhref="http://store.steampowered.com/search/?category2=2"\bclass="name"*>(.*?)</a>\g

You should never parse HTML/XML or any other language that allows cascading using regular expressions.
A nice thing with HTML however, is that it can be converted to XML and XML has a nice toolkit for parsing:
echo 'Co-Op' | tidy -asxhtml -numeric 2> /dev/null | xmllint --html --xpath 'normalize-space(//a[#class="name" and #href="http://store.steampowered.com/search/?category2=2"])' - 2>/dev/null
With query:
normalize-space(//a[#class="name" and #href="http://store.steampowered.com/search/?category2=2"])
// means any tag (regardless of it's depth), a means the a tag, and we furthermore specify the constraints that class=name and href=(the link). And then we returned the normalize-space content between the such tag <a> and </a>.
In Python you can use:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://store.steampowered.com/app/24860/").read()
soup = BeautifulSoup(page)
print soup.find_all('a',attrs={'class':'name','href':'http://store.steampowered.com/search/?category2=2'})
Comment on your regex:
the problem is that it contains tokens like ? that are interpreted as regex-directives rather than characters. You need to escape them. It should probably read:
<a\s+href="http://store\.steampowered\.com/search/\?category2=2"\s+class="name"\S*>(.*?)</a>\g
I also replaced \b with \s, \s means space characters like space, tab, new line. Although the regex is quite fragile: if one ever decides to swap href and class, the program has a problem. For most of these problems, there are indeed solutions, but you better use an XML analysis tool.

Related

Bad named links search and replace

The problem i'm facing is badly named links...
There are few hundred bad links in different files.
So I write bash to replace links
<a href="../../../external.html?link=http://www.twitter.com"><a href="../../external.html?link=http://www.facebook.com/pages/somepage/">
<a href="../external.html?link=http://www.tumblr.com/">
to direct links like
<a href="http://www.twitter.com>
I know we have pattern ../ repeating one or more times. Also external.html?link which also should be removed.
How would recommend to do this? awk, sed, maybe python??
Will i need regex?
Thanks for opinions...

This could be a place where regular expressions are the correct solution. You are only searching for text in attributes, and the contents are regular, fitting a pattern.
The following python regular expression would locate these links for you:
r'href="((?:\.\./)+external\.html\?link=)([^"]+)"'
The pattern we look for is something inside a href="" chunk of text, where that 'something' starts with one or more instances of ../, followed by external.html?link=, then followed with any text that does not contain a " quote.
The matched text after the equals sign is grouped in group 2 for easy retrieval, group 1 holds the ../../external.html?link= part.
If all you want to do is remove the ../../external.html?link= part altogether (so the links point directly to the endpoint instead of going via the redirect page), leave off the first group and do a simple .sub() on your HTML files:
import re
redirects = re.compile(r'href="(?:\.\./)+external\.html\?link=([^"]+)"')
# ...
redirects.sub(r'href="\1"', somehtmlstring)
Note that this could also match any body text (so outside HTML tags), this is not a HTML-aware solution. Chances are there is no such body text though. But if there is, you'll need a full-blown HTML parser like BeautifulSoup or lxml instead.

Use a HTML parser like BeautifulSoup or lxml.html.

need to selectively escape html entities (&)

I'm scraping a html page, then using xml.dom.minidom.parseString() to create a dom object.
however, the html page has a '&'. I can use cgi.escape to convert this into & but it also converts all my html <> tags into <> which makes parseString() unhappy.
how do i go about this? i would rather not just hack it and straight replace the "&"s
thanks

For scraping, try to use a library that can handle such html "tag soup", like lxml, which has a html parser (as well as a dedicated html package in lxml.html), or BeautifulSoup (you will also find that these libraries also contain other stuff that makes scraping/working with html easier, aside from being able to handle ill-formed documents: getting information out of forms, making hyperlinks absolute, using css selectors...)

i would rather not just hack it and
straight replace the "&"s
Er, why? That's what cgi.escape is doing - effectively just a search and replace operation for certain characters that have to be escaped.
If you only want to replace a single character, just replace the single character:
yourstring.replace('&', '&')
Don't beat around the bush.

If you want to make sure that you don't accidentally re-escape an already escaped & (i. e. not transform & into &amp; or ß into &szlig;), you could
import re
newstring = re.sub(r"&(?![A-Za-z])", "&", oldstring)
This will leave &s alone when they are followed by a letter.

You shouldn't use an XML parser to parse data that isn't XML. Find an HTML parser instead, you'll be happier in the long run. The standard library has a few (HTMLParser and htmllib), and BeautifulSoup is a well-loved third-party package.

How to find/replace text in html while preserving html tags/structure

I use regexps to transform text as I want, but I want to preserve the HTML tags.
e.g. if I want to replace "stack overflow" with "stack underflow", this should work as
expected: if the input is stack <sometag>overflow</sometag>, I must obtain stack <sometag>underflow</sometag> (i.e. the string substitution is done, but the
tags are still there...

Use a DOM library, not regular expressions, when dealing with manipulating HTML:
lxml: a parser, document, and HTML serializer. Also can use BeautifulSoup and html5lib for parsing.
BeautifulSoup: a parser, document, and HTML serializer.
html5lib: a parser. It has a serializer.
ElementTree: a document object, and XML serializer
cElementTree: a document object implemented as a C extension.
HTMLParser: a parser.
Genshi: includes a parser, document, and HTML serializer.
xml.dom.minidom: a document model built into the standard library, which html5lib can parse to.
Stolen from http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/.
Out of these I would recommend lxml, html5lib, and BeautifulSoup.

Beautiful Soup or HTMLParser is your answer.

Note that arbitrary replacements can't be done unambiguously. Consider the following examples:
1)
HTML:
A<tag>B</tag>
Pattern -> replacement:
AB -> AXB
Possible results:
AX<tag>B</tag>
A<tag>XB</tag>
2)
HTML:
A<tag>A</tag>A
Pattern -> replacement:
A+ -> WXYZ
Possible results:
W<tag />XYZ
W<tag>X</tag>YZ
W<tag>XY</tag>Z
W<tag>XYZ</tag>
WX<tag />YZ
WX<tag>Y</tag>Z
WX<tag>YZ</tag>
WXY<tag />Z
WXY<tag>Z</tag>
WXYZ
What kind of algorithms work for your case depends highly on the nature of possible search patterns and desired rules for handling ambiguity.

Use html parser such as provided by lxml or BeautifulSoup. Another option is to use XSLT transformations (XSLT in Jython).

I don't think that the DOM / HTML parser library recommendations posted so far address the specific problem in the given example: overflow should replaced with underflow only when preceded by stack in the rendered document, whether or not there are tags between them. Such a library is a necessary part the solution, though.
Assuming that tags never appear in the middle of words, one solution would be to
process the DOM, tokenize all text nodes and insert a unique identifier
at the beginning of each token (e.g. word)
render the document as plain text
search and replace the plain text with regexes which use groups to match, preserve and
mark unique identifiers at the beginning of each token
extract all tokens with marked unique identifiers from the plain text
process the DOM by removing unique identifiers and replacing tokens matching
marked unique identifiers with corresponding changed tokens
render the processed DOM back to HTML
Example:
In 1. the HTML DOM,
stack <sometag>overflow</sometag>
becomes the DOM
#1;stack <sometag>#2;overflow</sometag>
and in 2. the plain text is produced:
#1;stack #2;overflow
The regex needed in 3. is #(\d+);stack\s+#(\d+);overflow\b and the replacement #\1;stack %\2;underflow. Note that only the second word is marked by changing # to % in the unique identifier, since the first word isn't altered.
In 4., the word underflow with the unique identifier numbered 2 is extracted from the resulting plain text since it was marked by changing the # to a %.
In 5., all #(\d+); identifiers are removed from text nodes of the DOM while looking up their numbers among extracted words. The number 1 is not found, so #1;stack is replaced with simply stack. The number 2 is found with the changed word underflow, so #2;overflow is replaced by underflow.
Finally in 6. the DOM is rendered back to the HTML document `stack underflow.

Fun stuff to try. It sorta works. My friends like it when I attach this script to a textarea and let them "translate" things. I guess you could use it for anything really. Meh. Check the code over a few times if you're going to use it, it works but I'm new to all this. I think it's been 2 or three weeks since I started studying the php.
<?php
$html = ('<div style="border: groove 2px;"><p>Dear so and so, after reviewing your application I. . .</p><p>More of the same...</p><p>sincerely,</p><p>Important Dude</p></div>');
$oldWords = array('important', 'sincerely');
$newWords = array('arrogant', 'ya sure');
// function for oldWords
function regex_oldWords_word_list(&$item1, $key)
{
$item1 = "/>([^<>]+)?\b$item1(tionally|istic|tion|ance|ence|less|ally|able|ness|ing|ity|ful|ant|est|ist|ic|al|ed|er|et|ly|y|s|d|'s|'d|'ve|'ll)?\b([^<>]+)?/";
}
// function for newWords
function format_newWords_results(&$item1, $key)
{
$item1 = ">$1<span style=\"color: red;\"><em> $item1$2</em></span>$3";
}
// apply regex to oldWords
array_walk($oldWords, 'regex_oldWords_word_list');
// apply formatting to newWords
array_walk($newWords, 'format_newWords_results');
//HTML is not always as perfect as we want it
$poo = array('/ /', '/>([a-zA-Z\']+)/', '/’/', '/;([a-zA-Z\']+)/', '/"([a-zA-Z\']+)/', '/([a-zA-Z\']+)</', '/\.\.+/', '/\. \.+/');
$unpoo = array(' ', '> $1', '\'', '; $1', '" $1', '$1 <', '. crap taco.', '. crap taco with cheese.');
//and maybe things will go back to normal sort of
$repoo = array('/> /', '/; /', '/" /', '/ </');
$muck = array('> ', ';', '"',' <');
//before
echo ($html);
//I don't know what was happening on the free host but I had to keep stripping slashes
//This is where the work is done anyway.
$html = stripslashes(preg_replace($repoo , $muck , (ucwords(preg_replace($oldWords , $newWords , (preg_replace($poo , $unpoo , (stripslashes(strtolower(stripslashes($html)))))))))));
//after
echo ('<hr/> ' . $html);
//now if only there were a way to keep it out of the area between
//<style>here</style> and <script>here</script> and tell it that english isn't math.
?>

skip over HTML tags in Regular Expression patterns

I'm trying to write a regular expression pattern (in python) for reformatting these template engine files.
Basically the scheme looks like this:
[$$price$$]
{
<h3 class="price">
$12.99
</h3>
}
I'm trying to make it remove any extra tabs\spaces\new lines so it should look like this:
[$$price$$]{<h3 class="price">$12.99</h3>}
I wrote this: (\t|\s)+? which works except it matches within the html tags, so h3 becomes h3class and I am unable to figure out how to make it ignore anything inside the tags.

Using regular expressions to deal with HTML is extremely error-prone; they're simply not the right tool.
Instead, use a HTML/XML-aware library (such as lxml) to build a DOM-style object tree; modify the text segments within the tree in-place, and generate your output again using said library.

Try this:
\r?\n[ \t]*
EDIT: The idea is to remove all newlines (either Unix: "\n", or Windows: "\r\n") plus any horizontal whitespace (TABs or spaces) that immediately follow them.

Alan,
I have to agree with Charles that the safest way is to parse the HTML, then work on the Text nodes only. Sounds overkill but that's the safest.
On the other hand, there is a way in regex to do that as long as you trust that the HTML code is correct (i.e. does not include invalid < and > in the tags as in: <a title="<this is a test>" href="look here">...)
Then, you know that any text has to be between > and < except at the very beginning and end (if you just get a snapshot of the page, otherwise there is the HTML tag minimum.)
So... You still need two regex's: find the text '>[^<]+<', then apply the other regex as you mentioned.
The other way, is to have an or with something like this (not tested!):
'(<[^>]*>)|([\r\n\f ]+)'
This will either find a tag or spaces. When you find a tag, do not replace, if you don't find a tag, replace with an empty string.

Regular expression to extract URL from an HTML link

I’m a newbie in Python. I’m learning regexes, but I need help here.
Here comes the HTML source:
http://www.ptop.se
I’m trying to code a tool that only prints out http://ptop.se. Can you help me please?

If you're only looking for one:
import re
match = re.search(r'href=[\'"]?([^\'" >]+)', s)
if match:
print(match.group(1))
If you have a long string, and want every instance of the pattern in it:
import re
urls = re.findall(r'href=[\'"]?([^\'" >]+)', s)
print(', '.join(urls))
Where s is the string that you're looking for matches in.
Quick explanation of the regexp bits:
r'...' is a "raw" string. It stops you having to worry about escaping characters quite as much as you normally would. (\ especially -- in a raw string a \ is just a \. In a regular string you'd have to do \\ every time, and that gets old in regexps.)
"href=[\'"]?" says to match "href=", possibly followed by a ' or ". "Possibly" because it's hard to say how horrible the HTML you're looking at is, and the quotes aren't strictly required.
Enclosing the next bit in "()" says to make it a "group", which means to split it out and return it separately to us. It's just a way to say "this is the part of the pattern I'm interested in."
"[^\'" >]+" says to match any characters that aren't ', ", >, or a space. Essentially this is a list of characters that are an end to the URL. It lets us avoid trying to write a regexp that reliably matches a full URL, which can be a bit complicated.
The suggestion in another answer to use BeautifulSoup isn't bad, but it does introduce a higher level of external requirements. Plus it doesn't help you in your stated goal of learning regexps, which I'd assume this specific html-parsing project is just a part of.
It's pretty easy to do:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_to_parse)
for tag in soup.findAll('a', href=True):
print(tag['href'])
Once you've installed BeautifulSoup, anyway.

Don't use regexes, use BeautifulSoup. That, or be so crufty as to spawn it out to, say, w3m/lynx and pull back in what w3m/lynx renders. First is more elegant probably, second just worked a heck of a lot faster on some unoptimized code I wrote a while back.

this should work, although there might be more elegant ways.
import re
url='http://www.ptop.se'
r = re.compile('(?<=href=").*?(?=")')
r.findall(url)

John Gruber (who wrote Markdown, which is made of regular expressions and is used right here on Stack Overflow) had a go at producing a regular expression that recognises URLs in text:
http://daringfireball.net/2009/11/liberal_regex_for_matching_urls
If you just want to grab the URL (i.e. you’re not really trying to parse the HTML), this might be more lightweight than an HTML parser.

Regexes are fundamentally bad at parsing HTML (see Can you provide some examples of why it is hard to parse XML and HTML with a regex? for why). What you need is an HTML parser. See Can you provide an example of parsing HTML with your favorite parser? for examples using a variety of parsers.
In particular you will want to look at the Python answers: BeautifulSoup, HTMLParser, and lxml.

this regex can help you, you should get the first group by \1 or whatever method you have in your language.
href="([^"]*)
example:
amgheziName
result:
http://www.amghezi.com

There's tonnes of them on regexlib

Yes, there are tons of them on regexlib. That only proves that RE's should not be used to do that. Use SGMLParser or BeautifulSoup or write a parser - but don't use RE's. The ones that seems to work are extremely compliated and still don't cover all cases.

This works pretty well with using optional matches (prints after href=) and gets the link only. Tested on http://pythex.org/
(?:href=['"])([:/.A-z?<_&\s=>0-9;-]+)
Oputput:
Match 1. /wiki/Main_Page
Match 2. /wiki/Portal:Contents
Match 3. /wiki/Portal:Featured_content
Match 4. /wiki/Portal:Current_events
Match 5. /wiki/Special:Random
Match 6. //donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en

You can use this.
<a[^>]+href=["'](.*?)["']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.