Python: store many regex matches in tuple? - python

I'm trying to make a simple Python-based HTML parser using regular expressions. My problem is trying to get my regex search query to find all the possible matches, then store them in a tuple.
Let's say I have a page with the following stored in the variable HTMLtext:
<ul>
<li class="active"><b>Back to the index</b></li>
<li><b>About Me!</b></li>
<li><b>Audio Production</b></li>
<li><b>Gallery</b></li>
<li><b>Misc</b></li>
<li><b>Shoot me an email</b></li>
</ul>
I want to perform a regex search on this text and return a tuple containing the last URL directory of each link. So, I'd like to return something like this:
pages = ["home", "about", "music", "photos", "stuff", "contact"]
So far, I'm able to use regex to search for one result:
pages = [re.compile('<a href="/blog/(.*)">').search(HTMLtext).group(1)]
Running this expression makespages = ['home'].
How can I get the regex search to continue for the whole text, appending the matched text to this tuple?
(Note: I know I probably should NOT be using regex to parse HTML. But I want to know how to do this anyway.)

Your pattern won’t work on all inputs, including yours. The .* is going to be too greedy (technically, it finds a maximal match), causing it to be the first href and the last corresponding close. The two simplest ways to fix this is to use either a minimal match, or else a negates character class.
# minimal match approach
pages = re.findall(r'<a\s+href="/blog/(.+?)">',
full_html_text, re.I + re.S)
# negated charclass approach
pages = re.findall(r'<a\s+href="/blog/([^"]+)">',
full_html_text, re.I)
Obligatory Warning
For simple and reasonably well-constrained text, regexes are just fine; after all, that’s why we use regex search-and-replace in our text editors when editing HTML! However, it gets more and more complicated the less you know about the input, such as
if there’s some other field intervening between the <a and the href, like <a title="foo" href="bar">
casing issues like <A HREF='foo'>
whitespace issues
alternate quotes like href='/foo/bar' instead of href="/foo/bar"
embedded HTML comments
That’s not an exclusive list of concerns; there are others. And so, using regexes on HTML thus is possible but whether it’s expedient depends on too many other factors to judge.
However, from the little example you’ve shown, it looks perfectly ok for your own case. You just have to spiff up your pattern and call the right method.

Use findall function of re module:
pages = re.findall('<a href="/blog/([^"]*)">',HTMLtext)
print(pages)
Output:
['home', 'about', 'music', 'photos', 'stuff', 'contact']

Use findall instead of search:
>>> pages = re.compile('<a href="/blog/(.*)">').findall(HTMLtext)
>>> pages
['home', 'about', 'music', 'photos', 'stuff', 'contact']

The re.findall() function and the re.finditer() function are used to find multiple matches.

To find all results use findall(). Also you need to compile the re only once and then you can reuse it.
href_re = re.compile('<a href="/blog/(.*)">') # Compile the regexp once
pages = href_re.findall(HTMLtext) # Find all matches - ["home", "about",

Related

Get only URL from string - Python

I am scraping a page with Python and BeautifulSoup library.
I have to get the URL only from this string. This actually is in href attribute of the a tag. I have scraped it but cannot seem to find a way to extract the URL from this
javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');
You can write a straightforward regex to extract the URL.
>>> import re
>>> href = "javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');"
>>> re.findall(r"'(.*?)'", href)
['/Sheraton-Tucson-Hotel-177/tnc/150/24795/en', 'TC_POPUP', 'width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no']
>>> _[0]
'/Sheraton-Tucson-Hotel-177/tnc/150/24795/en'
The regex in question here is
'(.*?)'
Which reads "find a single-quote, followed by whatever (and capture the whatever), followed by another single quote, and do so non-greedily because of the ? operator". This extracts the arguments of window.open; then, just pick the first one to get the URL.
You shouldn't have any nested ' in your href, since those should be escaped to %27. If you do, though, this will not work, and you may need a solution that doesn't use regexes.
I did it that way.
terms = javascript:void%20window.open('/Sheraton-Tucson-Hotel-177/tnc/150/24795/en','TC_POPUP','width=490,height=405,screenX=300,screenY=250,top=250,left=300,scrollbars=yes,resizable=no');
terms.split("('")[1].split("','")[0]
outputs
/Sheraton-Tucson-Hotel-177/tnc/150/24795/en
Instead of a regex, you could just partition it twice on something, (eg: '):
s.partition("'")[2].partition("'")[0]
# /Sheraton-Tucson-Hotel-177/tnc/150/24795/en
Here's a quick and ugly answer
href.split("'")[1]

Find by Text and Replace in HTML BeautifulSoup

I'm trying to mark up an HTML file (literally wrapping strings in "mark" tags) using python and BeautifulSoup. The problem is basically as follows...
Say I have my original html document:
test = "<h1>oh hey</h1><div>here is some <b>SILLY</b> text</div>"
I want to do a case-insensitive search for a string in this document (ignoring HTML) and wrap it in "mark" tags. So let's say I want to find "here is some silly text" in the html (ignoring the bold tags). I'd like to take the matching html and wrap it in "mark" tags.
For example, if I want to search for "here is some silly text" in test, the desired output is:
"<h1>oh hey</h1><div><mark>here is some <b>SILLY</b> text</mark></div>"
Any ideas? If it's more appropriate to use lxml or regular expressions, I'm open to those solutions as well.
>>> soup = bs4.BeautifulSoup(test)
>>> matches = soup.find_all(lambda x: x.text.lower() == 'here is some silly text')
>>> for match in matches:
... match.wrap(soup.new_tag('mark'))
>>> soup
<html><body><h1>oh hey</h1><mark><div>here is some <b>SILLY</b> text</div></mark></body></html>
The reason I had to pass a function as the name to find_all that compares x.text.lower(), instead of just using the text argument with a function that compares x.lower(), is that the latter will not find the content in some cases that you apparently want.
The wrap function may not work this way in some cases. If it doesn't, you will have to instead enumerate(matches), and set matches[i] = match.wrap(soup.new_tag('mark')). (You can't use replace_with to replace a tag with a new tag that references itself.)
Also note that if your intended use case allows any non-ASCII string to ever match 'here is some silly text' (or if you want to broaden the code to handle non-ASCII search strings), the code above using lower() may be incorrect. You may want to call str.casefold() and/or locale.strxfrm(s) and/or use locale.strcoll(s, t) instead of using ==, but you'll have to understand what you want and how to get it to pick the right answer.

Bad named links search and replace

The problem i'm facing is badly named links...
There are few hundred bad links in different files.
So I write bash to replace links
<a href="../../../external.html?link=http://www.twitter.com"><a href="../../external.html?link=http://www.facebook.com/pages/somepage/">
<a href="../external.html?link=http://www.tumblr.com/">
to direct links like
<a href="http://www.twitter.com>
I know we have pattern ../ repeating one or more times. Also external.html?link which also should be removed.
How would recommend to do this? awk, sed, maybe python??
Will i need regex?
Thanks for opinions...
This could be a place where regular expressions are the correct solution. You are only searching for text in attributes, and the contents are regular, fitting a pattern.
The following python regular expression would locate these links for you:
r'href="((?:\.\./)+external\.html\?link=)([^"]+)"'
The pattern we look for is something inside a href="" chunk of text, where that 'something' starts with one or more instances of ../, followed by external.html?link=, then followed with any text that does not contain a " quote.
The matched text after the equals sign is grouped in group 2 for easy retrieval, group 1 holds the ../../external.html?link= part.
If all you want to do is remove the ../../external.html?link= part altogether (so the links point directly to the endpoint instead of going via the redirect page), leave off the first group and do a simple .sub() on your HTML files:
import re
redirects = re.compile(r'href="(?:\.\./)+external\.html\?link=([^"]+)"')
# ...
redirects.sub(r'href="\1"', somehtmlstring)
Note that this could also match any body text (so outside HTML tags), this is not a HTML-aware solution. Chances are there is no such body text though. But if there is, you'll need a full-blown HTML parser like BeautifulSoup or lxml instead.
Use a HTML parser like BeautifulSoup or lxml.html.

How can I make a regular expression to extract all anchor tags or links from a string?

I've seen other questions which will parse either all plain links, or all anchor tags from a string, but nothing that does both.
Ideally, the regular expression will be able to parse a string like this (I'm using Python):
>>> import re
>>> content = '
http://www.google.com Some other text.
And even more text! http://stackoverflow.com
'
>>> links = re.findall('some-regular-expression', content)
>>> print links
[u'http://www.google.com', u'http://stackoverflow.com']
Is it possible to produce a regular expression which would not result in duplicate links being returned? Is there a better way to do this?
No matter what you do, it's going to be messy. Nevertheless, a 90% solution might resemble:
r'<a\s[^>]*>([^<]*)</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'
Since that pattern has two groups, it will return a list of 2-tuples; to join them, you could use a list comprehension or even a map:
map(''.join, re.findall(pattern, content))
If you want the src attribute of the anchor instead of the link text, the pattern gets even messier:
r'<a\s[^>]*src=[\'"]([^"\']*)[\'"][^>]*>[^<]*</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'
Alternatively, you can just let the second half of the pattern pick up the src attribute, which also alleviates the need for the string join:
r'\b\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()]'
Once you have this much in place, you can replace any found links with something that doesn't look like a link, search for '://', and update the pattern to collect what it missed. You may also have to clean up false positives, particularly garbage at the end. (This pattern had to find links that included spaces, in plain text, so it's particularly prone to excess greediness.)
Warning: Do not rely on this for future user input, particularly when security is on the line. It is best used only for manually collecting links from existing data.
Usually you should never parse HTML with regular expressions since HTML isn't a regular language. Here it seems you only want to get all the http-links either they are in an A element or in text. How about getting them all and then remove the duplicates?
Try something like
set(re.findall("(http:\/\/.*?)[\"' <]", content))
and see if it serves your purpose.
Writing a regex pattern that matches all valid url is tricky business.
If all you're looking for is to detect simple http/https URLs within an arbitrary string, I could offer you this solution:
>>> import re
>>> content = 'http://www.google.com Some other text. And even more text! http://stackoverflow.com'
>>> re.findall(r"https?://[\w\-.~/?:#\[\]#!$&'()*+,;=]+", content)
['http://www.google.com', 'http://www.google.com', 'http://stackoverflow.com']
That looks for strings that start with http:// or https:// followed by one or more valid chars.
To avoid duplicate entries, use set():
>>> list(set(re.findall(r"https?://[\w\-.~/?:#\[\]#!$&'()*+,;=]+", content)))
['http://www.google.com', 'http://stackoverflow.com']
You should not use regular expressions to extract things from HTML. You should use an HTML parser.
If you also want to extract things from the text of the page then you should do that separately.
Here's how you would do it with lxml:
# -*- coding: utf8 -*-
import lxml.html as lh
import re
html = """
is.gd/testhttp://www.google.com Some other text.
And even more text! http://stackoverflow.com
here's a url bit.ly/test
"""
tree = lh.fromstring(html)
urls = set([])
for a in tree.xpath('//a'):
urls.add(a.text)
for text in tree.xpath('//text()'):
for url in re.findall(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', text):
urls.add(url[0])
print urls
Result:
set(['http://www.google.com', 'bit.ly/test', 'http://stackoverflow.com', 'is.gd/test'])
URL matchine regex from here: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
No, it will not be able to parse string like this. Regexp are capable of simple matching and you can't handle parsing a complicated grammar as html just with one or two regexps.

Retrieving string from html on non-unique table

Here is the html I am trying to parse.
<TD>Serial Number</TD><TD>AB12345678</TD>
I am attempting to use regex to parse the data. I heard about BeautifulSoup but there are around 50 items like this on the page all using the same table parameters and none of them have ID numbers. The closest they have to unique identifiers is the data in the cell before the data I need.
serialNumber = re.search("Serial Number</td><td>\n(.*?)</td>", source)
Source is simply the source code of the page grabbed using urllib. There is new line in the html between the second and the serial number but I am unsure if that matters.
Pyparsing can give you a little more robust extractor for your data:
from pyparsing import makeHTMLTags, Word, alphanums
htmlfrag = """<blah></blah><TD>Serial Number</TD><TD>
AB12345678
</TD><stuff></stuff>"""
td,tdEnd = makeHTMLTags("td")
sernoFormat = (td + "Serial Number" + tdEnd +
td + Word(alphanums)('serialNumber') + tdEnd)
for sernoData in sernoFormat.searchString(htmlfrag):
print sernoData.serialNumber
Prints:
AB12345678
Note that pyparsing doesn't care where the extra whitespace falls, and it also handles unexpected attributes that might crop up in the defined tags, whitespace inside tags, tags in upper/lower case, etc.
In most of the cases it is better to work on html using an appropriate parser, but for some cases it is perfectly OK to use regular expressions for the job. I do not know enough about your task to judge if it is a good solution or if it is better to go with #Paul 's solution, but here I try to fix your regex:
serialNumber = re.search("Serial Number</td><td>(.*?)</td>", source, re.S | re.I )
I removed the \n, because it is difficult in my opinion (\n,\r,\r\n, ...?), instead I used the option re.S (Dotall).
But be aware, now if there is a newline, it will be in your capturing group! i.e. you should strip whitespaces afterwards from your result.
Another problem of your regex is the <TD> in your string but you search for <td>. There for is the option re.I (IgnoreCase).
You can find more explanations about regex here on docs.python.org

Categories