Suppose I have the following string:
"<p>Hello</p>NOT<p>World</p>"
and i want to extract the words Hello and World
I created the following script for the job
#!/usr/bin/env python
import re
string = "<p>Hello</p>NOT<p>World</p>"
match = re.findall(r"(<p>[\w\W]+</p>)", string)
print match
I'm not particularly interested in stripping < p> and < /p> so I never bothered doing it within the script.
The interpreter prints
['<p>Hello</p>NOT<p>World</p>']
so it obviously sees the first < p> and the last < /p> while disregarding the in between tags. Shouldn't findall() return all three sets of matching strings though? (the string it prints, and the two words).
And if it shouldn't, how can i alter the code to do so?
PS: This is for a project and I found an alternative way to do what i needed to, so this is for educational reasons I guess.
The reason that you get the entire contents in a single match is because [\w\W]+ will match as many things as it can (including all of your <p> and </p> tags). To prevent this, you want to use the non-greedy version by appending a ?.
match = re.findall(r"(<p>[\w\W]+?</p>)", string)
# ['<p>Hello</p>', '<p>World</p>']
From the documentation:
*?, +?, ??
The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only <a>.
If you don't want the <p> and </p> tags in the result, you will want to use look-ahead and look behind assertions to not include them in the result.
match = re.findall(r"((?<=<p>)\w+?(?=</p>))", string)
# ['Hello', 'World']
As a side note though, if you are trying to parse HTML or XML with regular expressions, it is preferable to use a library such as BeautifulSoup which is intended for parsing HTML.
Related
I want to replace some "markdown" tags into html tags.
for example:
#Title1#
##title2##
Welcome to **My Home Page**
will be turned into
<h1>Title1</h1>
<h2>title2</h2>
Welcome to <b>My Home Page</b>
I just don't know how to do that...For Title1,I tried this:
#!/usr/bin/env python3
import re
text = '''
#Title1#
##title2##
'''
p = re.compile('^#\w*#\n$')
print(p.sub('<h1>\w*</h1>',text))
but nothing happens..
#Title1#
##title2##
How could those bbcode/markdown language come into html tags?
Check this regex: demo
Here you can see how I substituted the #...# into <h1>...</h1>.
I believe you can get this to work with double # and so on to get other markdown features considered, but still you should listen to #Thomas and #nhahtdh comments and use a markdown parser. Using regexes in such cases is unreliable, slow and unsafe.
As for inline text like **...** to <b>...</b> you can try this regex with substitution: demo. Hope you can twink this for other features like underlining and so on.
Your regular expression does not work because in the default mode, ^ and $ (respectively) matches the beginning and the end of the whole string.
'^'
(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline (my emph.)
'$'
Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.
(7.2.1. Regular Expression Syntax)
Add the flag re.MULTILINE in your compile line:
p = re.compile('^#(\w*)#\n$', re.MULTILINE)
and it should work – at least for single words, such as your example. A better check would be
p = re.compile('^#([^#]*)#\n$', re.MULTILINE)
– any sequence that does not contain a #.
In both expressions, you need to add parentheses around the part you want to copy so you can use that text in your replacement code. See the official documentation on Grouping for that.
Question:
Is there any way to tell a regular expression engine to treat a certain part of a regular expression as verbatim (i.e. look for that part exactly as it is, without the usual parsing) without manually escaping special characters?
Some context:
I'm trying to backreference a group on a given regular expression from another regular expression. For instance, suppose I want to match hello(.*?)olleh against text 1 and then look for bye$1eyb in text 2, where $1 will be replaced by whatever matched group 1 in text 1. Therefore, if text 1 happens to contain the string "helloFOOolleh", the program will look for "byeFOOeyb" in text 2.
The above works fine in most cases, but if text 1 were to contain something like "hello.olleh", the program will match not only "hello.olleh" but also "helloXolleh", "hellouolleh", etc. in text 2, as it is interpreting . as a regex special character and not the plain dot character.
Additional comments:
I can't just search for the plain string resulting from parsing $1 into whatever group 1 matches, as whatever I want to search for in text 2 could itself contain other unrelated regular expressions.
I have been trying to avoid parsing the match returned from text 1 and escape every single special character, but if anyone knows of a way to do that neatly that could also work.
I'm currently working on this in Python, but if it can be done easily with any other language/program I'm happy to give it a try.
You can use the re.escape function to escape the text you want to match literally. So after you extract your match text (e.g., "." in "hello.olleh"), apply re.escape to it before inserting it into your second regex.
To illustrate what BrenBarn wrote,
import re
text1 = "hello.olleh"
text2_match = "bye.eyb"
text2_nomatch = "byeXeyb"
found = re.fullmatch(r"hello(.*?)olleh", text1).group(1)
You can then make a new search with the re.escape:
new_search = "bye{}eyb".format(re.escape(found))
Tests:
re.search(new_search, text2_match)
#>>> <_sre.SRE_Match object; span=(0, 7), match='bye.eyb'>
re.search(new_search, text2_nomatch)
#>>> None
I'm having problems with this method in python called findall. I'm accessing a web pages HTML and trying to return the name of a product in this case 'bread' and print it out to the console.
Don't use regex for HTML parsing.
There are a few solutions. I suggest BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)
Having said so, however, in this particular case, RE will suffice. Just relax it a notch. There might be more or less spaces or maybe those are tabs. So instead of literal spaces use the space class \s:
product = re.findall(r'Item:\s*is\s*in\s*lane\s*12\s*(\w*)', content)
print product[0]
Since The '*', '+', and '?' qualifiers are all greedy (they match as much text as possible) you don't need to restrict it with [^<]*<br>
In case you still want to use regexps, here's a working one for your case:
product = re.findall(r'<br>\s*Item:\s+is\s+in\s+lane 12\s+(\w*)[^<]*<br>', content)
It takes into account DSM's space flexibility suggestion and non-letters after (\w*) that might appear before <br>.
I'm confused about python greedy/not-greedy characters.
"Given multi-line html, return the final tag on each line."
I would think this would be correct:
re.findall('<.*?>$', html, re.MULTILINE)
I'm irked because I expected a list of single tags like:
"</html>", "<ul>", "</td>".
My O'Reilly's Pocket Reference says that *? wil "match 0 or more times, but as few times as possible."
So why am I getting 'greedier' matches, i.e., more than one tag in some (but not all) matches?
Your problem stems from the fact that you have an end-of-line anchor ('$'). The way non-greedy matching works is that the engine first searches for the first unconstrained pattern on the line ('<' in your case). It then looks for the first '>' character (which you have constrained, with the $ anchor, to be at the end of the line). So a non-greedy * is not any different from a greedy * in this situation.
Since you cannot remove the '$' from your RE (you are looking for the final tag on a line), you will need to take a different tack...see #Mark's answer. '<[^><]*>$' will work.
I'm writing a very simple bbcode parse. If i want to replace hello i'm a [b]bold[/b] text, i have success with replacing this regex
r'\[b\](.*)\[\/b\]'
with this
<strong>\g<1></strong>
to get hello, i'm a <strong>bold</strong> text.
If I have two or more tags of the same type, it fails. eg:
i'm [b]bold[/b] and i'm [b]bold[/b] too
gives
i'm <strong>bold[/b] and i'm [b]bold</strong> too
How to solve the problem? Thanks
You shouldn't use regular expressions to parse non-regular languages (like matching tags). Look into a parser instead.
Edit - a quick Google search takes me here.
Just change your regular expression from:
r'\[b\](.*)\[\/b\]'
to
r'\[b\](.*?)\[\/b\]'
The * qualifier is greedy, appending a ? to it you make it performing as a non-greedy qualifier.
Here's a more complete explaination taken from the python re documentation:
The '*', '+', and '?' qualifiers are
all greedy; they match as much text as
possible. Sometimes this behaviour
isn’t desired; if the RE <.*> is
matched against '<H1>title</H1>', it
will match the entire string, and not
just '<H1>'. Adding '?' after the
qualifier makes it perform the match
in non-greedy or minimal fashion; as
few characters as possible will be
matched. Using .*? in the previous
expression will match only '<H1>'.
Source: http://docs.python.org/library/re.html