Bad named links search and replace - python

The problem i'm facing is badly named links...
There are few hundred bad links in different files.
So I write bash to replace links
<a href="../../../external.html?link=http://www.twitter.com"><a href="../../external.html?link=http://www.facebook.com/pages/somepage/">
<a href="../external.html?link=http://www.tumblr.com/">
to direct links like
<a href="http://www.twitter.com>
I know we have pattern ../ repeating one or more times. Also external.html?link which also should be removed.
How would recommend to do this? awk, sed, maybe python??
Will i need regex?
Thanks for opinions...

This could be a place where regular expressions are the correct solution. You are only searching for text in attributes, and the contents are regular, fitting a pattern.
The following python regular expression would locate these links for you:
r'href="((?:\.\./)+external\.html\?link=)([^"]+)"'
The pattern we look for is something inside a href="" chunk of text, where that 'something' starts with one or more instances of ../, followed by external.html?link=, then followed with any text that does not contain a " quote.
The matched text after the equals sign is grouped in group 2 for easy retrieval, group 1 holds the ../../external.html?link= part.
If all you want to do is remove the ../../external.html?link= part altogether (so the links point directly to the endpoint instead of going via the redirect page), leave off the first group and do a simple .sub() on your HTML files:
import re
redirects = re.compile(r'href="(?:\.\./)+external\.html\?link=([^"]+)"')
# ...
redirects.sub(r'href="\1"', somehtmlstring)
Note that this could also match any body text (so outside HTML tags), this is not a HTML-aware solution. Chances are there is no such body text though. But if there is, you'll need a full-blown HTML parser like BeautifulSoup or lxml instead.

Use a HTML parser like BeautifulSoup or lxml.html.

Related

Issues with extracting URLs from text

I am trying to find a regular expression to extract any valid URLs (not only http[s]) using a regular expression. Unfortunately, each one outputs weird things. The best results I achieved using this regex:
\b((?:[a-z][\w\-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]|\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\))+(?:\((?:[^\s()<>]|(?:\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
But I can mark at least the following issues:
http://208.206.41.61/email/email_log.cfm?useremail=3Dtana.jones#enron.com&=refdoc=3D(01-128) is extracted as http://208.206.41.61/email/email_log.cfm?useremail=3Dtana.jones#enron.com&=
http://www.onlinefilefolder.com',AJAXTHRESHOLD should be extracted without AJAXTHRESHOLD
CSS / HTML styling is extracted, for example xmlns:x="urn:schemas-microsoft-com:xslt, ze:12px;color:#666, font-size:12px;color etc
How can I improve this regex to make sure only valid URLs are extracted? I am not only extracting it from the HTML, but also from a plain text. Therefore, using only beautifulsoup is impossible for my use case.
No regex is perfect, but this one might help you:
(?:(?:https?|ftp|file):\/\/|www\.|ftp\.)(?:\([-A-Z0-9+&##\/%=~_|$?!:,.]*\)|[-A-Z0-9+&##\/%=~_|$?!:,.])*(?:\([-A-Z0-9+&##\/%=~_|$?!:,.]*\)|[A-Z0-9+&##\/%=~_|$])
Flag to enable: insensitive, global, multiline (igm)
Source: http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/

Regular Expression - HTML

I am kind of new to regular expressions, but the one i made myself doesn't work. It is supposed to give me data from a websites html.
I basically want to get this out of html, and all of the multiple ones. I have the page url as a string btw.
Co-Op
And what i've done for my regexp is:
<a\bhref="http://store.steampowered.com/search/?category2=2"\bclass="name"*>(.*?)</a>\g
You should never parse HTML/XML or any other language that allows cascading using regular expressions.
A nice thing with HTML however, is that it can be converted to XML and XML has a nice toolkit for parsing:
echo 'Co-Op' | tidy -asxhtml -numeric 2> /dev/null | xmllint --html --xpath 'normalize-space(//a[#class="name" and #href="http://store.steampowered.com/search/?category2=2"])' - 2>/dev/null
With query:
normalize-space(//a[#class="name" and #href="http://store.steampowered.com/search/?category2=2"])
// means any tag (regardless of it's depth), a means the a tag, and we furthermore specify the constraints that class=name and href=(the link). And then we returned the normalize-space content between the such tag <a> and </a>.
In Python you can use:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://store.steampowered.com/app/24860/").read()
soup = BeautifulSoup(page)
print soup.find_all('a',attrs={'class':'name','href':'http://store.steampowered.com/search/?category2=2'})
Comment on your regex:
the problem is that it contains tokens like ? that are interpreted as regex-directives rather than characters. You need to escape them. It should probably read:
<a\s+href="http://store\.steampowered\.com/search/\?category2=2"\s+class="name"\S*>(.*?)</a>\g
I also replaced \b with \s, \s means space characters like space, tab, new line. Although the regex is quite fragile: if one ever decides to swap href and class, the program has a problem. For most of these problems, there are indeed solutions, but you better use an XML analysis tool.

How can I make a regular expression to extract all anchor tags or links from a string?

I've seen other questions which will parse either all plain links, or all anchor tags from a string, but nothing that does both.
Ideally, the regular expression will be able to parse a string like this (I'm using Python):
>>> import re
>>> content = '
http://www.google.com Some other text.
And even more text! http://stackoverflow.com
'
>>> links = re.findall('some-regular-expression', content)
>>> print links
[u'http://www.google.com', u'http://stackoverflow.com']
Is it possible to produce a regular expression which would not result in duplicate links being returned? Is there a better way to do this?
No matter what you do, it's going to be messy. Nevertheless, a 90% solution might resemble:
r'<a\s[^>]*>([^<]*)</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'
Since that pattern has two groups, it will return a list of 2-tuples; to join them, you could use a list comprehension or even a map:
map(''.join, re.findall(pattern, content))
If you want the src attribute of the anchor instead of the link text, the pattern gets even messier:
r'<a\s[^>]*src=[\'"]([^"\']*)[\'"][^>]*>[^<]*</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'
Alternatively, you can just let the second half of the pattern pick up the src attribute, which also alleviates the need for the string join:
r'\b\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()]'
Once you have this much in place, you can replace any found links with something that doesn't look like a link, search for '://', and update the pattern to collect what it missed. You may also have to clean up false positives, particularly garbage at the end. (This pattern had to find links that included spaces, in plain text, so it's particularly prone to excess greediness.)
Warning: Do not rely on this for future user input, particularly when security is on the line. It is best used only for manually collecting links from existing data.
Usually you should never parse HTML with regular expressions since HTML isn't a regular language. Here it seems you only want to get all the http-links either they are in an A element or in text. How about getting them all and then remove the duplicates?
Try something like
set(re.findall("(http:\/\/.*?)[\"' <]", content))
and see if it serves your purpose.
Writing a regex pattern that matches all valid url is tricky business.
If all you're looking for is to detect simple http/https URLs within an arbitrary string, I could offer you this solution:
>>> import re
>>> content = 'http://www.google.com Some other text. And even more text! http://stackoverflow.com'
>>> re.findall(r"https?://[\w\-.~/?:#\[\]#!$&'()*+,;=]+", content)
['http://www.google.com', 'http://www.google.com', 'http://stackoverflow.com']
That looks for strings that start with http:// or https:// followed by one or more valid chars.
To avoid duplicate entries, use set():
>>> list(set(re.findall(r"https?://[\w\-.~/?:#\[\]#!$&'()*+,;=]+", content)))
['http://www.google.com', 'http://stackoverflow.com']
You should not use regular expressions to extract things from HTML. You should use an HTML parser.
If you also want to extract things from the text of the page then you should do that separately.
Here's how you would do it with lxml:
# -*- coding: utf8 -*-
import lxml.html as lh
import re
html = """
is.gd/testhttp://www.google.com Some other text.
And even more text! http://stackoverflow.com
here's a url bit.ly/test
"""
tree = lh.fromstring(html)
urls = set([])
for a in tree.xpath('//a'):
urls.add(a.text)
for text in tree.xpath('//text()'):
for url in re.findall(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', text):
urls.add(url[0])
print urls
Result:
set(['http://www.google.com', 'bit.ly/test', 'http://stackoverflow.com', 'is.gd/test'])
URL matchine regex from here: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
No, it will not be able to parse string like this. Regexp are capable of simple matching and you can't handle parsing a complicated grammar as html just with one or two regexps.

strip only html anchor tags

i have following code that strip all tags. now i want to strip only anchor tags.
x = re.compile(r'<[^<]*?/?>')
how to modify so that only anchor tags stripped.
following code that strip all tags.
Not really. <div title="a>b"> is valid HTML and gets mangled. <div title="<" onmouseover="script()" class="<">"> is invalid HTML but the kind of thing you will often find on real web pages. Your regexp leaves an active tag with dangerous scripting in it.
You can't do an HTML-processing task like tag-stripping with regex, unless your possible input set is heavily restricted. Better to use a real HTML parser and walk across the resulting document removing unwanted elements as you go.
eg. with BeautifulSoup:
def replaceWithContents(element):
ix= element.parent.contents.index(element)
for child in reversed(element.contents):
element.parent.insert(ix, child)
element.extract()
doc= BeautifulSoup(html) # maybe fromEncoding= 'utf-8'
for link in doc.findAll('a'):
replaceWithContents(link)
str(doc)
x = re.compile(r'<[aA]\>[^<]*?/?>')
This will match the 'a' or 'A' followed by a word boundary. Note that it won't clean out the closing tag.
x = re.compile(r'</?[aA]\>[^<]*?/?>')
will remove the closing tag as well.
EDIT:
Actually, it feels more reliable to switch the [^<] to [^>], like so.
x = re.compile(r'</?[aA]\>[^>]*?/?>')
I'm not sure if this Python is correct (I'm a PHP guy but am just starting to learn python in my own time).
re.sub('<[aA][^>]*>([^<]+)</[aA]>','\1','<html><head> .... </body></html>')
This won't remove all anchor tags in one shot, so you may have to loop over the html string. It matches the anchor tags and replaces the match with the contents of the tags. So ...
homepage -> homepage
Might not be the most efficient on a large body of text but works.

skip over HTML tags in Regular Expression patterns

I'm trying to write a regular expression pattern (in python) for reformatting these template engine files.
Basically the scheme looks like this:
[$$price$$]
{
<h3 class="price">
$12.99
</h3>
}
I'm trying to make it remove any extra tabs\spaces\new lines so it should look like this:
[$$price$$]{<h3 class="price">$12.99</h3>}
I wrote this: (\t|\s)+? which works except it matches within the html tags, so h3 becomes h3class and I am unable to figure out how to make it ignore anything inside the tags.
Using regular expressions to deal with HTML is extremely error-prone; they're simply not the right tool.
Instead, use a HTML/XML-aware library (such as lxml) to build a DOM-style object tree; modify the text segments within the tree in-place, and generate your output again using said library.
Try this:
\r?\n[ \t]*
EDIT: The idea is to remove all newlines (either Unix: "\n", or Windows: "\r\n") plus any horizontal whitespace (TABs or spaces) that immediately follow them.
Alan,
I have to agree with Charles that the safest way is to parse the HTML, then work on the Text nodes only. Sounds overkill but that's the safest.
On the other hand, there is a way in regex to do that as long as you trust that the HTML code is correct (i.e. does not include invalid < and > in the tags as in: <a title="<this is a test>" href="look here">...)
Then, you know that any text has to be between > and < except at the very beginning and end (if you just get a snapshot of the page, otherwise there is the HTML tag minimum.)
So... You still need two regex's: find the text '>[^<]+<', then apply the other regex as you mentioned.
The other way, is to have an or with something like this (not tested!):
'(<[^>]*>)|([\r\n\f ]+)'
This will either find a tag or spaces. When you find a tag, do not replace, if you don't find a tag, replace with an empty string.

Categories