Python regex for matching bb code - python

I'm writing a very simple bbcode parse. If i want to replace hello i'm a [b]bold[/b] text, i have success with replacing this regex
r'\[b\](.*)\[\/b\]'
with this
<strong>\g<1></strong>
to get hello, i'm a <strong>bold</strong> text.
If I have two or more tags of the same type, it fails. eg:
i'm [b]bold[/b] and i'm [b]bold[/b] too
gives
i'm <strong>bold[/b] and i'm [b]bold</strong> too
How to solve the problem? Thanks

You shouldn't use regular expressions to parse non-regular languages (like matching tags). Look into a parser instead.
Edit - a quick Google search takes me here.

Just change your regular expression from:
r'\[b\](.*)\[\/b\]'
to
r'\[b\](.*?)\[\/b\]'
The * qualifier is greedy, appending a ? to it you make it performing as a non-greedy qualifier.
Here's a more complete explaination taken from the python re documentation:
The '*', '+', and '?' qualifiers are
all greedy; they match as much text as
possible. Sometimes this behaviour
isn’t desired; if the RE <.*> is
matched against '<H1>title</H1>', it
will match the entire string, and not
just '<H1>'. Adding '?' after the
qualifier makes it perform the match
in non-greedy or minimal fashion; as
few characters as possible will be
matched. Using .*? in the previous
expression will match only '<H1>'.
Source: http://docs.python.org/library/re.html

Related

Regular expression error: unbalanced parenthesis at position n

I have been meaning to extract the month name from the following string with regex and despite the fact that my regex works on a platform like regex101, I can't seem to be able to extract the word "August".
import re
s = "word\anyword\2021\August\202108_filename.csv"
re.findall("\d+\\([[:alpha:]]+)\\\d+", s)
Which results in the following error:
error: unbalanced parenthesis at position 17
I also tried using re.compile, re.escape as per suggestions of the previous posts dealing with the same error but none of them seems to work.
Any help and also a little explanation on why this isn't working is greatly appreciated.
You can use
import re
s = r"word\anyword\2021\August\202108_filename.csv"
m = re.search(r"\d+\\([a-zA-Z]+)\\\d+", s)
if m:
print(m.group(1))
See the Python demo.
There are three main problems here:
The input string should be the same as used at regex101.com, i.e. you need to make sure you are using literal backslashes in the Python code, hence the use of raw string literals for both the input text and regex
The POSIX character classes are not supported by Python re, so [[:alpha:]]+ should be replaced with some equivalent pattern, say, [A-Za-z]+ or [^\W\d_]+
Since it seems like you only expect a single match (there is only one August (month) name in the string), you do not need re.findall, you can use re.search. Only use re.findall when you need to extract multiple matches from a string.
Also, see these posts:
Python regex - r prefix
What does the "r" in pythons re.compile(r' pattern flags') mean?
What exactly do "u" and "r" string flags do, and what are raw string literals?

findall() behaviour (python 2.7)

Suppose I have the following string:
"<p>Hello</p>NOT<p>World</p>"
and i want to extract the words Hello and World
I created the following script for the job
#!/usr/bin/env python
import re
string = "<p>Hello</p>NOT<p>World</p>"
match = re.findall(r"(<p>[\w\W]+</p>)", string)
print match
I'm not particularly interested in stripping < p> and < /p> so I never bothered doing it within the script.
The interpreter prints
['<p>Hello</p>NOT<p>World</p>']
so it obviously sees the first < p> and the last < /p> while disregarding the in between tags. Shouldn't findall() return all three sets of matching strings though? (the string it prints, and the two words).
And if it shouldn't, how can i alter the code to do so?
PS: This is for a project and I found an alternative way to do what i needed to, so this is for educational reasons I guess.
The reason that you get the entire contents in a single match is because [\w\W]+ will match as many things as it can (including all of your <p> and </p> tags). To prevent this, you want to use the non-greedy version by appending a ?.
match = re.findall(r"(<p>[\w\W]+?</p>)", string)
# ['<p>Hello</p>', '<p>World</p>']
From the documentation:
*?, +?, ??
The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against <a> b <c>, it will match the entire string, and not just <a>. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only <a>.
If you don't want the <p> and </p> tags in the result, you will want to use look-ahead and look behind assertions to not include them in the result.
match = re.findall(r"((?<=<p>)\w+?(?=</p>))", string)
# ['Hello', 'World']
As a side note though, if you are trying to parse HTML or XML with regular expressions, it is preferable to use a library such as BeautifulSoup which is intended for parsing HTML.

search a repeated structure with regex

I have a string of the structure:
A_1: text
a lot more text
A_2: some text
a lot more other text
Now I want to extract the descriptive title (A_1) and the following text. Something like
[("A_1", "text\na lot more text"),("A_2", "some text\na lot more other text")]
My expression I use is
(A_\d+):([.\s]+)
But I get only [('A_1', ' '), ('A_2', ' ')].
Has someone an idea for me?
Thanks in advance,
Martin
You can use a lookahead to limit the match to another occurence of the searched start indicator.
(?s)A_\d+:.*?(?=\s*A_\d+:|$)
(?s) dotall flag to make dot also match newlines
A_\d+: your start indicator
.*? match as few as possible (lazy dot)
(?=\s*A_\d+:|$) until start pattern with optional spaces ahead or $ end
See demo at regex101.com (Python code generator)
Your [.\s]+ matches one or more literal dots (since . inside a character class loses its special meaning) and whitespaces. I think you meant to use . with a re.DOTALL flag. However, you can use something different, a tempered greedy token (there are other ways, too).
You can use
(?s)(A_\d+):\s*((?:(?!A_\d).)+)
See regex demo
IDEONE demo:
import re
p = re.compile(r'(A_\d+):\s*((?:(?!A_\d).)+)', re.DOTALL)
test_str = "A_1: text\na lot more text\n\nA_2: some text\na lot more other text"
print(p.findall(test_str))
The (?:(?!A_\d).)+ tempered greedy token will match any text up to the first A_+digit pattern.

how to replace markdown tags into html by python?

I want to replace some "markdown" tags into html tags.
for example:
#Title1#
##title2##
Welcome to **My Home Page**
will be turned into
<h1>Title1</h1>
<h2>title2</h2>
Welcome to <b>My Home Page</b>
I just don't know how to do that...For Title1,I tried this:
#!/usr/bin/env python3
import re
text = '''
#Title1#
##title2##
'''
p = re.compile('^#\w*#\n$')
print(p.sub('<h1>\w*</h1>',text))
but nothing happens..
#Title1#
##title2##
How could those bbcode/markdown language come into html tags?
Check this regex: demo
Here you can see how I substituted the #...# into <h1>...</h1>.
I believe you can get this to work with double # and so on to get other markdown features considered, but still you should listen to #Thomas and #nhahtdh comments and use a markdown parser. Using regexes in such cases is unreliable, slow and unsafe.
As for inline text like **...** to <b>...</b> you can try this regex with substitution: demo. Hope you can twink this for other features like underlining and so on.
Your regular expression does not work because in the default mode, ^ and $ (respectively) matches the beginning and the end of the whole string.
'^'
(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline (my emph.)
'$'
Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.
(7.2.1. Regular Expression Syntax)
Add the flag re.MULTILINE in your compile line:
p = re.compile('^#(\w*)#\n$', re.MULTILINE)
and it should work – at least for single words, such as your example. A better check would be
p = re.compile('^#([^#]*)#\n$', re.MULTILINE)
– any sequence that does not contain a #.
In both expressions, you need to add parentheses around the part you want to copy so you can use that text in your replacement code. See the official documentation on Grouping for that.

Regular Expression (regex) Pattern Matching in Python

My regular expression in python is as follows:
\\newcommand\\shortpage[.*?][.*?]{.*?{.*?}}
The text I am trying to match is:
\newcommand\shortpage[1][1]{\enlargethispage*{-#1\baselineskip}} % see Latex Companion, 2nd ed., p. 234
How do I fix my regular expression so that it properly matches my text?
Thank you.
Brackets and braces are metacharacters, you need to escape them:
\\newcommand\\shortpage\[.*?\]\[.*?\]\{.*?\{.*?\}\}
Actually, many regex engines don't require you to escape braces if it can be inferred from context that they aren't used as quantifiers (as in x{2,4}), but it's better to be explicit.
Furthermore, .* and .*? should be replaced, if possible, with something more specific than "match anything":
\\newcommand\\shortpage\[[^\]]*\]\[[^\]]*\]\{[^}]*\{[^}]*\}\}

Categories