Python regex to find all case of curly brackets, inclusive of brackets - python

I would to like find and replace strings within double curly brackets, inclusive of the brackets themselves.
For example:
{{hello}}
should ideally return:
{{hello}}
I found this expression: {{(.*?)}} here
However this only returns text between the brackets. Ie. With this expression, the example above would return: "hello", rather than "{{hello}}"

I'm assuming you're using re.findall() which only returns the contents of capturing groups, if they are present in the regex.
Since you don't want them, just remove the capturing parentheses:
matches = re.findall("{{.*?}}", mystring)
When using re.sub() for replacing, the original regex will work just fine.

You can just move the ( and ) to include the brackets within the group.
Try: ({{.*?}}). I highly recommend a regex tester for this kind of work.

Related

Finding pattern in Python

I have a file. I want to create a pattern. Then, I will calculate some values. I am using code below. How can I split these 3 values? Also, there white spaces in some columns as below pic.
title=movies[3].split(",")[1]
title
pattern=r"(\d+)[(\d+)-(\d+)]"
import re
re.findall(pattern,title)
You forgot to escape the characters [ and ]. Now it is seen as a definition of the defined set of characters between those brackets.
Change your regular expression to:
pattern = r"(\d+)\[(\d+)-(\d+)\]"
To add optional whitespaces to the regular expression you can use \s*. So the full regular expression would be
pattern = r"(\d+)\[\s*(\d+)-(\d+)\s*\]"
If you're looking specifically for the - character, you will need to escape it using \, as this will make the expression look for the character instead of using it for its other usage of defining a range.
This means your new pattern should look like: (\d+)[(\d+)\-(\d+)]
As a side-note, I can recommend using regex101 to double-check your patterns before using them!
Adding Matthias' answer onto here. If the square brackets are part of the title, you will also want to escape them too like so: (\d+)\[(\d+)\-(\d+)\]
This will look for two values separated by a hyphen, within square brackets.

Return multiple matches within a string using a regex

I'm using the following regex to clean up a document that has had apostrophes accidentally replaced with double quotes:
([a-zA-Z]\"[a-zA-Z])
That find's the first pattern match within the string, but not any subsequent ones. I've used the '*' operator after the group, which I understood would return multiple matches of that pattern, but this returns none. I've tested the regex here by adding double quotes to the example string.
Does anyone know what the operator I need is for this example?
Thanks
You might need to turn on global matching, which in Python is done by using re.findall() instead of re.search(). On Regexr, the global flag is enabled like this:
regex flags menu on top right corner http://puu.sh/kgLFC/5958420d09.png

Python regex - match everything not in mustache brackets

I'm trying to create a regex in Python to match everything not in mustache brackets.
For example, this string:
This is an {{example}} for {{testing}}.
should produce this list of strings:
["This is an ", " for ", "."]
when used with re.findall.
For my mustache-matching regex, I am using this: {{(.*?)}}.
It seems like it should just be a simple matter of negating the above pattern, but I can't get it to work properly. I'm testing using: http://pythex.org
Thanks.
You need split
re.split('{{.*?}}', s)
#Andrey solution is the most simple and clearly the way to go. You have another possible way with findall because when the pattern contains a capture group, it returns only the capture group content:
re.findall(r'(?:{{.*?}})*([^{]+)', s)
So, if you start the pattern with an optional non-capturing group for curly brackets parts followed by a capture group for other content, findall returns only the capture group content and all curly brackets parts are consumed.

Backreferencing in Python: findall() method output for HTML string

I am trying to learn some regular expressions in Python. The following does not produce the output I expected:
with open('ex06-11.html') as f:
a = re.findall("<div[^>]*id\\s*=\\s*([\"\'])header\\1[^>]*>(.*?)</div>", f.read())
# output: [('"', 'Some random text')]
The output I was expecting (same code, but without the backreference):
with open('ex06-11.html') as f:
print re.findall("<div[^>]*id\\s*=\\s*[\"\']header[\"\'][^>]*>(.*?)</div>", f.read())
# output: ['Some random text']
The question really boils down to: why is there a quotation mark in my first output, but not in my second? I thought that ([abc]) ... //1 == [abc] ... [abc]. Am I incorrect?
From the docs on re.findall:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
If you want the entire match to be returned, remove the capturing groups or change them to non-capturing groups by adding ?: after the opening paren. For example you would change (foo) in your regex to (?:foo).
Of course in this case you need the capturing group for the backreference, so your best bet is to keep your current regex and then use a list comprehension with re.finditer() to get a list of only the second group:
regex = re.compile(r"""<div[^>]*id\s*=\s*(["'])header\1[^>]*>(.*?)</div>""")
with open('ex06-11.html') as f:
a = [m.group(2) for m in regex.finditer(f.read())
A couple of side notes, you should really consider using an HTML parser like BeautifulSoup instead of regex. You should also use triple-quoted strings if you need to include single or double quotes within you string, and use raw string literals when writing regular expressions so that you don't need to escape the backslashes.
The behaviour is clearly documented. See re.findall:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found.
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, if you have a capture group in your regex pattern, then findall method returns a list of tuple, containing all the captured groups for a particular match, plus the group(0).
So, either you use a non-capturing group - (?:[\"\']), or don't use any group at all, as in your 2nd case.
P.S: Use raw string literals for your regex pattern, to avoid escaping your backslashes. Also, compile your regex outside the loop, so that is is not re-compiled on every iteration. Use re.compile for that.
When I asked this question I was just starting with regular expressions. I have since read the docs completely, and I just wanted to share what I found out.
Firstly, what Rohit and F.J suggested, use raw strings (to make the regex more readable and less error-prone) and compile your regex beforehand using re.compile. To match an HTML string whose id is 'header':
s = "<div id='header'>Some random text</div>"
We would need a regex like:
p = re.compile(r'<div[^>]*id\s*=\s*([\"\'])header\1[^>]*>(.*?)</div>')
In the Python implementation of regex, a capturing group is made by enclosing part of your regex in parentheses (...). Capturing groups capture the span of text that they match. They are also needed for backreferencing. So in my regex above, I have two capturing groups: ([\"\']) and (.*?). The first one is needed to make the backreference \1 possible. The use of a backreferences (and the fact that they reference back to a capturing group) has consequences, however. As pointed out in the other answers to this question, when using findall on my pattern p, findall will return matches from all groups and put them in a list of tuples:
print p.findall(s)
# [("'", 'Some random text')]
Since we only want the plain text from between the HTML tags, this is not the output we're looking for.
(Arguably, we could use:
print p.findall(s)[0][1]
# Some random text
But this may be a bit contrived.)
So in order to return only the text from between the HTML tags (captured by the second group), we use the group() method on p.search():
print p.search(s).group(2)
# Some random text
I'm fully aware that all but the most simple HTML should not be handled by regex, and that instead you should use a parser. But this was just a tutorial example for me to grasp the basics of regex in Python.

Python regex - Ignore parenthesis as indexing?

I've currently written a nooby regex pattern which involves excessive use of the "(" and ")" characters, but I'm using them for 'or' operators, such as (A|B|C) meaning A or B or C.
I need to find every match of the pattern in a string.
Trying to use the re.findall(pattern, text) method is no good, since it interprets the parenthesis characters as indexing signifiers (or whatever the correct jargon be), and so each element of the produced List is not a string showing the matched text sections, but instead is a tuple (which contain very ugly snippets of pattern match).
Is there an argument I can pass to findall to ignore paranthesis as indexing?
Or will I have to use a very ugly combination of re.search, and re.sub
(This is the only solution I can think of; Find the index of the re.search, add the matched section of text to the List then remove it from the original string {by using ugly index tricks}, continuing this until there's no more matches. Obviously, this is horrible and undesirable).
Thanks!
Yes, add ?: to a group to make it non-capturing.
import re
print re.findall('(.(foo))', "Xfoo") # [('Xfoo', 'foo')]
print re.findall('(.(?:foo))', "Xfoo") # ['Xfoo']
See re syntax for more information.
re.findall(r"(?:A|B|C)D", "BDE")
or
re.findall(r"((?:A|B|C)D)", "BDE")

Categories