Python Regex match parenthesis but not nested parenthesis

Python Regex match parenthesis but not nested parenthesis - python

Is it possible to match parenthesis like () but not allowing nesting? In other words, I want my regex to match () but not (())
The regex that I am trying is
\(\[^\(\)])
but it does not seem to be working. Can someone explain to me what I'm doing wrong?

If (foo) in x(foo)x shall be matched, but (foo) in ((foo)) not, what you want is not possible with regular expressions, as regular expressions represent regular grammars and all regular grammars are context free. But context (or 'state', as Jonathon Reinhart called it in his comment) is necessary for the distinction between the (foo) substrings in x(foo)x and ((foo)).
If you only want to match strings that only consist of a parenthesized substring, without any parentheses (matched or unmatched) in that substring, the following regex will do:
^\([^()]*\)$
^ and $ 'glue' the pattern to the beginning and end of the string, respectively, thereby excluding partial matches
note the arbitrary number of repetitions (…*) of the non-parenthesis character inside the parentheses.
note how special characters are not escaped inside a character set, but still have their literal meaning. (Putting backslashes in there would put literal backslashes in the character set. Or in this case out of the character set, due to the negation.)
note how the [ starting the character set isn't escaped, because we actually want its special meaning, rather than is literal meaning
The last two points might be specific to the dialect of regular expressions Python uses.
So this will match () and (foo) completely, but not (not even partially) (foo)bar), (foo(bar), x(foo), (foo)x or ()().

Related

How does the regex "\" character and grouping "()" character work together?

I am trying to see which statements the following pattern matches:
\(*[0-9]{3}\)*-*[0-9]{3}\d\d\d+
I am a little confused because the grouping characters () have a \ before it. Does this mean that the statement must have a ( and )? Would that mean the statements without ( or ) be unmatched?
Statements:
'4046782347'
'(123)1247890'
'456900900'
'(678)2001236'
'4041231234'
'(404123123'

Context is important:
re.match(r'\(', content) matches a literal parenthesis.
re.match(r'\(*', content) matches 0 or more literal parentheses, thus making the parens optional (and allowing more than one of them, but that's clearly a bug).
Since the intended behavior isn't "0 or more" but rather "0 or 1", this should probably be written r'\(?' instead.
That said, there's a whole lot about this regex that's silly. I'd consider instead:
[(]?\d{3}[)]?-?\d{6,}
Using [(]? avoids backslashes, and consequently is easier to read whether it's rendered by str() or repr() (which escapes backslashes).
Mixing [0-9] and \d is silly; better to pick one and stick with it.
Using * in place of ? is silly, unless you really want to match (((123))456-----7890.
\d{3}\d\d\d+ matches three digits, then three or more additional digits. Why not just match six or more digits in the first place?

Normally, the parentheses would act as grouping characters, however regex metacharacters are reduced simply to the raw characters when preceded by a backslash. From the Python docs:
As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
In your case, the statements don't need parentheses in order to match, as each \( and \) in the expression is followed by a *, which means that the previous character can be matched any number of times, including none at all. From the Python docs:
* doesn’t match the literal character *; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.
Thus the statements with or without parentheses around the first 3 digits may match.
Source: https://docs.python.org/2/howto/regex.html

What is the use of the following statement in python regular expression?

I am new to python and i need to work on an existing python script. Can someone explain me what is the meaning of the following statement
pgre = re.compile("([^T]+)T([^\.]+)\.[^\s]+\s(\d+\.\d+):\s\[.+\]\s+(\d+)K->(\d+)K\((\d+)K\),\s(\d+\.\d+)\ssecs\]")

You need to consult the references for the exact meanings of each part of that regular expression, but the basic purpose of it is to parse the GC logging. Each parenthesized part of the expression () is a group that matches a useful part of the GC line.
For example, the start of the regex ([^T]+)T matches everything up to the first "T", and the grouped part returns the text before the "T", i.e. the date "2013-08-28"
The content of the group, [^T]+ means "at least one character that is not a T"
Patterns in square brackets [] are character classes - consult the references in the comments above for details. Note that your input text contains literal square brackets, so the pattern handles those with the \[ escape sequence - see below.
I think you can simplify ([^T]+)T to just (.+)T, incidentally.
Other useful sub-patterns:
\s matches whitespace
\d matches numeric digits
\. \( and \[ match literal periods, parentheses, and square braces, respectively, rather than interpreting them as special regex characters

Regular expression sub

I have a question about regular expression sub in python. So, I have some lines of code and what I want is to replace all floating point values eg: 2.0f,-1.0f...etc..to doubles 2.0,-1.0. I came up with this regular expression '[-+]?[0-9]*\.?[0-9]+f' and it finds what I need but I am not sure how to replace it?
so here's what I have:
# check if floating point value exists
if re.findall('[-+]?[0-9]*\.?[0-9]+f', line):
line = re.sub('[-+]?[0-9]*\.?[0-9]+f', ????? ,line)
I am not sure what to put under ????? such that it will replace what I found in '[-+]?[0-9]*\.?[0-9]+f' without the char f in the end of the string.
Also there might be more than one floating point values, which is why I used re.findall
Any help would be great. Thanks

Capture the part of the text you want to save in a capturing group and use the \1 substitution operator:
line = re.sub(r'([-+]?[0-9]*\.?[0-9]+)f', r'\1' ,line)
Note that findall (or any kind of searching) is unnecessary since re.sub will look for the pattern itself and return the string unchanged if there are no matches.
Now, for several regular expression writing tips:
Always use raw strings (r'...') for regular expressions and substitution strings, otherwise you will need to double your backslashes to escape them from Python's string parser. It is only by accident that you didn't need to do this for \., since . is not part of an escape sequence in Python strings.
Use \d instead of [0-9] to match a digit. They are equivalent, but \d is easier to recognize for "digit", while [0-9] needs to be visually verified.
Your regular expression will not recognize 10.f, which is likely a valid decimal number in your input. Matching floating-point numbers in various formats is trickier than it seems at first, but simple googling will reveal many reasonably complete solutions for this.
The re.X flag will allow you to add arbitrary whitespace and even comments to your regexp. With small regexps that can seem downright silly, but for large expressions the added clarity is a life-saver. (Your regular expression is close to the threshold.)
Here is an example of an extended regular expression that implements the above style tips:
line = re.sub(r'''
( [-+]?
(?: \d+ (?: \.\d* )? # 12 or 12. or 12.34
|
\.\d+ # .12
)
) f''',
r'\1', line, flags=re.X)
((?:...) is a non-capturing group, only used for precedence.)

This is my goto reference for all things regex.
http://www.regular-expressions.info/named.html
The result should be something like:
line = re.sub('(<first>[-+]?[0-9]*\).?[0-9]+f', '\g<first>', line)

Surround the part of the regex you want to "keep" in a "capture group", e.g.
'([-+]?[0-9]*\.?[0-9]+)f'
^ ^
And then you can refer to these capture groups using \1 in your substitution:
r'\1'
For future reference, you can have many capture groups, i.e. \2, \3, etc. by order of the opening parentheses.

Can I mix character classes in Python RegEx?

Special sequences (character classes) in Python RegEx are escapes like \w or \d that matches a set of characters.
In my case, I need to be able to match all alpha-numerical characters except numbers.
That is, \w minus \d.
I need to use the special sequence \w because I'm dealing with non-ASCII characters and need to match symbols like "Æ" and "Ø".
One would think I could use this expression: [\w^\d] but it doesn't seem to match anything and I'm not sure why.
So in short, how can I mix (add/subtract) special sequences in Python Regular Expressions?
EDIT: I accidentally used [\W^\d] instead of [\w^\d]. The latter does indeed match something, including parentheses and commas which are not alpha-numerical characters as far as I'm concerned.

You can use r"[^\W\d]", ie. invert the union of non-alphanumerics and numbers.

You cannot subtract character classes, no.
Your best bet is to use the regex project, which offers additional functionality while remaining backwards compatible with the re module in in the standard library. It supports character classes based on Unicode properties:
\p{IsAlphabetic}
This will match any character that the Unicode specification states is an alphabetic character.
Even better, regex does support character class subtraction; it views such classes as sets and allows you to create a difference with the -- operator:
[\w--\d]
matches everything in \w except anything that also matches \d.

You can exclude classes using a negative lookahead assertion, such as r'(?!\d)[\w]' to match a word character, excluding digits. For example:
>>> re.search(r'(?!\d)[\w]', '12bac')
<_sre.SRE_Match object at 0xb7779218>
>>> _.group(0)
'b'
To exclude more than one group, you can use the usual [...] syntax in the lookahead assertion, for example r'(?![0-5])[\w]' would match any alphanumeric character except for digits 0-5.
As with [...], the above construct matches a single character. To match multiple characters, add a repetition operator:
>>> re.search(r'((?!\d)[\w])+', '12bac15')
<_sre.SRE_Match object at 0x7f44cd2588a0>
>>> _.group(0)
'bac'

I don't think you can directly combine (boolean and) character sets in a single regex, whether one is negated or not. Otherwise you could simply have combined [^\d] and \w.
Note: the ^ has to be at the start of the set, and applies to the whole set. From the docs: "If the first character of the set is '^', all the characters that are not in the set will be matched.".
Your set [\w^\d] tries to match an alpha-numerical character, followed by a caret, followed by a digit. I can imagine that doesn't match anything either.
I would do it in two steps, effectly combining the regular expressions. First match by non-digits (inner regex), then match by alpha-numerical characters:
re.search('\w+', re.search('([^\d]+)', s).group(0)).group(0)
or variations to this theme.
Note that would need to surround this with a try: except: block, as it will throw an AttributeError: 'NoneType' object has no attribute 'group' in case one of the two regexes fails. But you can, of course, split this single line up in a few more lines.

python "re" package, strange phenomenon with "raw" string

I am seeing the following phenomenon, couldn't seem to figure it out, and didn't find anything with some search through archives:
if I type in:
>>> if re.search(r'\n',r'this\nis\nit'):<br>
... print 'found it!'<br>
... else:<br>
... print "didn't find it"<br>
...
I will get:
didn't find it!
However, if I type in:
>>> if re.search(r'\\n',r'this\nis\nit'):<br>
... print 'found it!'<br>
... else:<br>
... print "didn't find it"<br>
...
Then I will get:
found it!
(The first one only has one backslash on the r'\n' whereas the second one has two backslashes in a row on the r'\\n' ... even this interpreter is removing one of them.)
I can guess what is going on, but I don't understand the official mechanism as to why this is happening: in the first case, I need to escape two things: both the regular expression and the special strings. "Raw" lets me escape the special strings, but not the regular expression.
But there will never be a regular expression in the second string, since it is the string being matched. So there is only a need to escape once.
However, something doesn't seem consistent to me: how am I supposed to ensure that the characters REALLY ARE taken literally in the first case? Can I type rr'' ? Or do I have to ensure that I escape things twice?
On a similar vein, how do I ensure that a variable is taken literally (or that it is NOT taken literally)? E.g., what if I had a variable tmp = 'this\nis\nmy\nhome', and I really wanted to find the literal combination of a slash and an 'n', instead of a newline?
Thanks!Mike

re.search(r'\n', r'this\nis\nit')
As you said, "there will never be a regular expression in the second string." So we need to look at these strings differently: the first string is a regex, the second just a string. Usually your second string will not be raw, so any backslashes are Python-escapes, not regex-escapes.
So the first string consists of a literal "\" and an "n". This is interpreted by the regex parser as a newline (docs: "Most of the standard escapes supported by Python string literals are also accepted by the regular expression parser"). So your regex will be searching for a newline character.
Your second string consists of the string "this" followed by a literal "\" and an "n". So this string does not contain an actual newline character. Your regex will not match.
As for your second regex:
re.search(r'\\n', r'this\nis\nit')
This version matches because your regex contains three characters: a literal "\", another literal "\" and an "n". The regex parser interprets the two slashes as a single "\" character, followed by an "n". So your regex will be searching for a "\" followed by an "n", which is found within the string. But that isn't very helpful, since it has nothing to do with newlines.
Most likely what you want is to drop the r from the second string, thus treating it as a normal Python string.
re.search(r'\n', 'this\nis\nit')
In this case, your regex (as before) is searching for a newline character. And, it finds it, because the second string contains the word "this" followed by a newline.

Escaping special sequences in string literals is one thing, escaping regular expression special characters is another. The row string modifier only effects the former.
Technically, re.search accepts two strings and passes the first to the regex builder with re.compile. The compiled regex object is used to search patterns inside simple strings. The second string is never compiled and thus it is not subject to regex special character rules.
If the regex builder receives a \n after the string literal is processed, it converts this sequence to a newline character. You also have to escape it if you need the match the sequence instead.
All rationale behind this is that regular expressions are not part of the language syntax. They are rather handled within the standard library inside the re module with common building blocks of the language.
The re.compile function uses special characters and escaping rules compatible with most commonly used regex implementations. However, the Python interpreter is not aware of the whole regular expression concept and it does not know whether a string literal will be compiled into a regex object or not. As a result, Python can't provide any kind syntax simplification such as the ones you suggested.

Regexes have their own meaning for literal backslashes, as character classes like \d. If you actually want a literal backslash character, you will in fact need to double-escape it. It's really not supposed to be parallel since you're comparing a regex to a string.
Raw strings are just a convenience, and it would be way overkill to have double-raw strings.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.