how to match two pattern into one in regex - python

I am using python regex to do some regex match.
pattern1 = re.compile('<a>(.*?)</a>[\s\S]*?<b>(.*?)</b>')
pattern2 = re.compile('<b>(.*?)</b>[\s\S]*?<a>(.*?)</a>')
items = re.findall(pattern1, line)
if items:
print items[0]
else:
items = re.findall(pattern2, line)
if items:
print items[0]
As you can see, the tag a and b sequence is not fixed (a can before or after b).
I used two patterns (try pattern 1 first, then try pattern 2) to find text between tag a and tag b, but it looks so ugly, But I do not know how to use one pattern to get same result as above code.
Thanks!

Please don't use regular expressions to parse HTML. Regular expressions can't deal with HMTL(*). There is more than one nice HTML parser for Python, use one of them.
The following example uses pyquery, a jQuery API implementation on top of lxml.
from pyquery import PyQuery as pq
html_doc = """
<body>
<a>A first</a><b>B second</b>
<p>Other stuff here</p>
<b>B first</b><a>A second</a>
</body>
"""
doc = pq(html_doc)
for item in doc("a + b, b + a").prev():
print item.text
output
A first
B first
Explanation: The selector a + b selects all <b> directly preceded by an <a>. .prev() moves to the immediately previous element, i.e. the <a> (which you seem to be interested in - but only when a <b> follows it). b + a does the same thing for the reverse element order.
(*) For one, regular expressions cannot handle indefinitely nested constructs, they have problems when match order is not predictable and they have no way of handling the semantic implications of HTML (character escape sequences, optionally and implicitly closed elements, lenient parsing of input that is not very strictly valid and more). They tend to break silently when the input is in a form that you did not anticipate. And, when thrown at HMTL, they tend to get so complex that they make anybody's head hurt. Don't invest your time in writing ever more sophisticated regular expressions to parse HTML, it's a losing battle. The best state you can end up in is something that kind of works but is still inferior to a parser. Invest your time in learning a parser.

Change it to:
re.compile('(?:<b>|<a>)(.*?)(?:</a>|</b>)[\s\S]*?(?:<a>|<b>)(.*?)(?:</a>|</b>)')
Note that this needs more attention as it matches <a> followed by </b>. If you want to prevent this, just catch the first group (<a> or <b>) and force it then, something like:
<\\\1>
this will match \ followed by the previous captured tag, which will be a or b.
I don't recommend using regex to parse HTML, use a parser instead.

Please use a HTML parser instead (as Tomalak and Maroun Maroun already suggested). For why, Tomalak already explained that.
I'll just provide a literal solution to your problem for fun:
To combine two patterns, just use |, like:
pattern = re.compile('<a>(.*?)</a>[\s\S]*?<b>(.*?)</b>|<b>(.*?)</b>[\s\S]*?<a>(.*?)</a>')
But now you capture 4 groups, so you have to manually check which groups you matched.
match = re.search(patternN, line)
if match.group(1, 2) != (None, None):
print match.group(1, 2)
else:
print match.group(3, 4)
Or, simpler, using a named group:
pattern = re.compile('<a>(?P<first>.*?)</a>[\s\S]*?<b>(.*?)</b>|<b>(.*?)</b>[\s\S]*?<a>(.*?)</a>')
match = re.search(pattern, line)
print match.group(1, 2) if match.group('first') else match.group(3, 4)

Related

How to perform a tag-agnostic text string search in an html file?

I'm using LanguageTool (LT) with the --xmlfilter option enabled to spell-check HTML files. This forces LanguageTool to strip all tags before running the spell check.
This also means that all reported character positions are off because LT doesn't "see" the tags.
For example, if I check the following HTML fragment:
<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
LanguageTool will treat it as a plain text sentence:
This is kind of a stupid question.
and returns the following message:
<error category="Grammar" categoryid="GRAMMAR" context=" This is kind of a stupid question. " contextoffset="24" errorlength="9" fromx="8" fromy="8" locqualityissuetype="grammar" msg="Don't include 'a' after a classification term. Use simply 'kind of'." offset="24" replacements="kind of" ruleId="KIND_OF_A" shortmsg="Grammatical problem" subId="1" tox="17" toy="8"/>
(In this particular example, LT has flagged "kind of a.")
Since the search string might be wrapped in tags and might occur multiple times I can't do a simple index search.
What would be the most efficient Python solution to reliably locate any given text string in an HTML file? (LT returns an approximate character position, which might be off by 10-30% depending on the number of tags, as well as the words before and after the flagged word(s).)
I.e. I'd need to do a search that ignores all tags, but includes them in the character position count.
In this particular example, I'd have to locate "kind of a" and find the location of the letter k in:
kin<b>d</b> o<i>f</i>a
This may not be the speediest way to go, but pyparsing will recognize HTML tags in most forms. The following code inverts the typical scan, creating a scanner that will match any single character, and then configuring the scanner to skip over HTML open and close tags, and also common HTML '&xxx;' entities. pyparsing's scanString method returns a generator that yields the matched tokens, the starting, and the ending location of each match, so it is easy to build a list that maps every character outside of a tag to its original location. From there, the rest is pretty much just ''.join and indexing into the list. See the comments in the code below:
test = "<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>"
from pyparsing import Word, printables, anyOpenTag, anyCloseTag, commonHTMLEntity
non_tag_text = Word(printables+' ', exact=1).leaveWhitespace()
non_tag_text.ignore(anyOpenTag | anyCloseTag | commonHTMLEntity)
# use scanString to get all characters outside of tags, and build list
# of (char,loc) tuples
char_locs = [(t[0], loc) for t,loc,endloc in non_tag_text.scanString(test)]
# imagine a world without HTML tags...
untagged = ''.join(ch for ch, loc in char_locs)
# look for our string in the untagged text, then index into the char,loc list
# to find the original location
search_str = 'kind of a'
orig_loc = char_locs[untagged.find(search_str)][1]
# print the test string, and mark where we found the matching text
print(test)
print(' '*orig_loc + '^')
"""
Should look like this:
<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
^
"""
The --xmlfilter option is deprecated because of issues like this. The proper solution is to remove the tags yourself but keep the positions so you have a mapping to correct the results that come back from LT. When using LT from Java, this is supported by AnnotatedText, but the algorithm should be simple enough to port it. (full disclosure: I'm the maintainer of LT)

Find by Text and Replace in HTML BeautifulSoup

I'm trying to mark up an HTML file (literally wrapping strings in "mark" tags) using python and BeautifulSoup. The problem is basically as follows...
Say I have my original html document:
test = "<h1>oh hey</h1><div>here is some <b>SILLY</b> text</div>"
I want to do a case-insensitive search for a string in this document (ignoring HTML) and wrap it in "mark" tags. So let's say I want to find "here is some silly text" in the html (ignoring the bold tags). I'd like to take the matching html and wrap it in "mark" tags.
For example, if I want to search for "here is some silly text" in test, the desired output is:
"<h1>oh hey</h1><div><mark>here is some <b>SILLY</b> text</mark></div>"
Any ideas? If it's more appropriate to use lxml or regular expressions, I'm open to those solutions as well.
>>> soup = bs4.BeautifulSoup(test)
>>> matches = soup.find_all(lambda x: x.text.lower() == 'here is some silly text')
>>> for match in matches:
... match.wrap(soup.new_tag('mark'))
>>> soup
<html><body><h1>oh hey</h1><mark><div>here is some <b>SILLY</b> text</div></mark></body></html>
The reason I had to pass a function as the name to find_all that compares x.text.lower(), instead of just using the text argument with a function that compares x.lower(), is that the latter will not find the content in some cases that you apparently want.
The wrap function may not work this way in some cases. If it doesn't, you will have to instead enumerate(matches), and set matches[i] = match.wrap(soup.new_tag('mark')). (You can't use replace_with to replace a tag with a new tag that references itself.)
Also note that if your intended use case allows any non-ASCII string to ever match 'here is some silly text' (or if you want to broaden the code to handle non-ASCII search strings), the code above using lower() may be incorrect. You may want to call str.casefold() and/or locale.strxfrm(s) and/or use locale.strcoll(s, t) instead of using ==, but you'll have to understand what you want and how to get it to pick the right answer.

Python: store many regex matches in tuple?

I'm trying to make a simple Python-based HTML parser using regular expressions. My problem is trying to get my regex search query to find all the possible matches, then store them in a tuple.
Let's say I have a page with the following stored in the variable HTMLtext:
<ul>
<li class="active"><b>Back to the index</b></li>
<li><b>About Me!</b></li>
<li><b>Audio Production</b></li>
<li><b>Gallery</b></li>
<li><b>Misc</b></li>
<li><b>Shoot me an email</b></li>
</ul>
I want to perform a regex search on this text and return a tuple containing the last URL directory of each link. So, I'd like to return something like this:
pages = ["home", "about", "music", "photos", "stuff", "contact"]
So far, I'm able to use regex to search for one result:
pages = [re.compile('<a href="/blog/(.*)">').search(HTMLtext).group(1)]
Running this expression makespages = ['home'].
How can I get the regex search to continue for the whole text, appending the matched text to this tuple?
(Note: I know I probably should NOT be using regex to parse HTML. But I want to know how to do this anyway.)
Your pattern won’t work on all inputs, including yours. The .* is going to be too greedy (technically, it finds a maximal match), causing it to be the first href and the last corresponding close. The two simplest ways to fix this is to use either a minimal match, or else a negates character class.
# minimal match approach
pages = re.findall(r'<a\s+href="/blog/(.+?)">',
full_html_text, re.I + re.S)
# negated charclass approach
pages = re.findall(r'<a\s+href="/blog/([^"]+)">',
full_html_text, re.I)
Obligatory Warning
For simple and reasonably well-constrained text, regexes are just fine; after all, that’s why we use regex search-and-replace in our text editors when editing HTML! However, it gets more and more complicated the less you know about the input, such as
if there’s some other field intervening between the <a and the href, like <a title="foo" href="bar">
casing issues like <A HREF='foo'>
whitespace issues
alternate quotes like href='/foo/bar' instead of href="/foo/bar"
embedded HTML comments
That’s not an exclusive list of concerns; there are others. And so, using regexes on HTML thus is possible but whether it’s expedient depends on too many other factors to judge.
However, from the little example you’ve shown, it looks perfectly ok for your own case. You just have to spiff up your pattern and call the right method.
Use findall function of re module:
pages = re.findall('<a href="/blog/([^"]*)">',HTMLtext)
print(pages)
Output:
['home', 'about', 'music', 'photos', 'stuff', 'contact']
Use findall instead of search:
>>> pages = re.compile('<a href="/blog/(.*)">').findall(HTMLtext)
>>> pages
['home', 'about', 'music', 'photos', 'stuff', 'contact']
The re.findall() function and the re.finditer() function are used to find multiple matches.
To find all results use findall(). Also you need to compile the re only once and then you can reuse it.
href_re = re.compile('<a href="/blog/(.*)">') # Compile the regexp once
pages = href_re.findall(HTMLtext) # Find all matches - ["home", "about",

How can I make a regular expression to extract all anchor tags or links from a string?

I've seen other questions which will parse either all plain links, or all anchor tags from a string, but nothing that does both.
Ideally, the regular expression will be able to parse a string like this (I'm using Python):
>>> import re
>>> content = '
http://www.google.com Some other text.
And even more text! http://stackoverflow.com
'
>>> links = re.findall('some-regular-expression', content)
>>> print links
[u'http://www.google.com', u'http://stackoverflow.com']
Is it possible to produce a regular expression which would not result in duplicate links being returned? Is there a better way to do this?
No matter what you do, it's going to be messy. Nevertheless, a 90% solution might resemble:
r'<a\s[^>]*>([^<]*)</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'
Since that pattern has two groups, it will return a list of 2-tuples; to join them, you could use a list comprehension or even a map:
map(''.join, re.findall(pattern, content))
If you want the src attribute of the anchor instead of the link text, the pattern gets even messier:
r'<a\s[^>]*src=[\'"]([^"\']*)[\'"][^>]*>[^<]*</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'
Alternatively, you can just let the second half of the pattern pick up the src attribute, which also alleviates the need for the string join:
r'\b\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()]'
Once you have this much in place, you can replace any found links with something that doesn't look like a link, search for '://', and update the pattern to collect what it missed. You may also have to clean up false positives, particularly garbage at the end. (This pattern had to find links that included spaces, in plain text, so it's particularly prone to excess greediness.)
Warning: Do not rely on this for future user input, particularly when security is on the line. It is best used only for manually collecting links from existing data.
Usually you should never parse HTML with regular expressions since HTML isn't a regular language. Here it seems you only want to get all the http-links either they are in an A element or in text. How about getting them all and then remove the duplicates?
Try something like
set(re.findall("(http:\/\/.*?)[\"' <]", content))
and see if it serves your purpose.
Writing a regex pattern that matches all valid url is tricky business.
If all you're looking for is to detect simple http/https URLs within an arbitrary string, I could offer you this solution:
>>> import re
>>> content = 'http://www.google.com Some other text. And even more text! http://stackoverflow.com'
>>> re.findall(r"https?://[\w\-.~/?:#\[\]#!$&'()*+,;=]+", content)
['http://www.google.com', 'http://www.google.com', 'http://stackoverflow.com']
That looks for strings that start with http:// or https:// followed by one or more valid chars.
To avoid duplicate entries, use set():
>>> list(set(re.findall(r"https?://[\w\-.~/?:#\[\]#!$&'()*+,;=]+", content)))
['http://www.google.com', 'http://stackoverflow.com']
You should not use regular expressions to extract things from HTML. You should use an HTML parser.
If you also want to extract things from the text of the page then you should do that separately.
Here's how you would do it with lxml:
# -*- coding: utf8 -*-
import lxml.html as lh
import re
html = """
is.gd/testhttp://www.google.com Some other text.
And even more text! http://stackoverflow.com
here's a url bit.ly/test
"""
tree = lh.fromstring(html)
urls = set([])
for a in tree.xpath('//a'):
urls.add(a.text)
for text in tree.xpath('//text()'):
for url in re.findall(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', text):
urls.add(url[0])
print urls
Result:
set(['http://www.google.com', 'bit.ly/test', 'http://stackoverflow.com', 'is.gd/test'])
URL matchine regex from here: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
No, it will not be able to parse string like this. Regexp are capable of simple matching and you can't handle parsing a complicated grammar as html just with one or two regexps.

Retrieving string from html on non-unique table

Here is the html I am trying to parse.
<TD>Serial Number</TD><TD>AB12345678</TD>
I am attempting to use regex to parse the data. I heard about BeautifulSoup but there are around 50 items like this on the page all using the same table parameters and none of them have ID numbers. The closest they have to unique identifiers is the data in the cell before the data I need.
serialNumber = re.search("Serial Number</td><td>\n(.*?)</td>", source)
Source is simply the source code of the page grabbed using urllib. There is new line in the html between the second and the serial number but I am unsure if that matters.
Pyparsing can give you a little more robust extractor for your data:
from pyparsing import makeHTMLTags, Word, alphanums
htmlfrag = """<blah></blah><TD>Serial Number</TD><TD>
AB12345678
</TD><stuff></stuff>"""
td,tdEnd = makeHTMLTags("td")
sernoFormat = (td + "Serial Number" + tdEnd +
td + Word(alphanums)('serialNumber') + tdEnd)
for sernoData in sernoFormat.searchString(htmlfrag):
print sernoData.serialNumber
Prints:
AB12345678
Note that pyparsing doesn't care where the extra whitespace falls, and it also handles unexpected attributes that might crop up in the defined tags, whitespace inside tags, tags in upper/lower case, etc.
In most of the cases it is better to work on html using an appropriate parser, but for some cases it is perfectly OK to use regular expressions for the job. I do not know enough about your task to judge if it is a good solution or if it is better to go with #Paul 's solution, but here I try to fix your regex:
serialNumber = re.search("Serial Number</td><td>(.*?)</td>", source, re.S | re.I )
I removed the \n, because it is difficult in my opinion (\n,\r,\r\n, ...?), instead I used the option re.S (Dotall).
But be aware, now if there is a newline, it will be in your capturing group! i.e. you should strip whitespaces afterwards from your result.
Another problem of your regex is the <TD> in your string but you search for <td>. There for is the option re.I (IgnoreCase).
You can find more explanations about regex here on docs.python.org

Categories