How does the regex "\" character and grouping "()" character work together?

How does the regex "\" character and grouping "()" character work together? - python

I am trying to see which statements the following pattern matches:
\(*[0-9]{3}\)*-*[0-9]{3}\d\d\d+
I am a little confused because the grouping characters () have a \ before it. Does this mean that the statement must have a ( and )? Would that mean the statements without ( or ) be unmatched?
Statements:
'4046782347'
'(123)1247890'
'456900900'
'(678)2001236'
'4041231234'
'(404123123'

Context is important:
re.match(r'\(', content) matches a literal parenthesis.
re.match(r'\(*', content) matches 0 or more literal parentheses, thus making the parens optional (and allowing more than one of them, but that's clearly a bug).
Since the intended behavior isn't "0 or more" but rather "0 or 1", this should probably be written r'\(?' instead.
That said, there's a whole lot about this regex that's silly. I'd consider instead:
[(]?\d{3}[)]?-?\d{6,}
Using [(]? avoids backslashes, and consequently is easier to read whether it's rendered by str() or repr() (which escapes backslashes).
Mixing [0-9] and \d is silly; better to pick one and stick with it.
Using * in place of ? is silly, unless you really want to match (((123))456-----7890.
\d{3}\d\d\d+ matches three digits, then three or more additional digits. Why not just match six or more digits in the first place?

Normally, the parentheses would act as grouping characters, however regex metacharacters are reduced simply to the raw characters when preceded by a backslash. From the Python docs:
As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
In your case, the statements don't need parentheses in order to match, as each \( and \) in the expression is followed by a *, which means that the previous character can be matched any number of times, including none at all. From the Python docs:
* doesn’t match the literal character *; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.
Thus the statements with or without parentheses around the first 3 digits may match.
Source: https://docs.python.org/2/howto/regex.html

Related

Python Regex match parenthesis but not nested parenthesis

Is it possible to match parenthesis like () but not allowing nesting? In other words, I want my regex to match () but not (())
The regex that I am trying is
\(\[^\(\)])
but it does not seem to be working. Can someone explain to me what I'm doing wrong?

If (foo) in x(foo)x shall be matched, but (foo) in ((foo)) not, what you want is not possible with regular expressions, as regular expressions represent regular grammars and all regular grammars are context free. But context (or 'state', as Jonathon Reinhart called it in his comment) is necessary for the distinction between the (foo) substrings in x(foo)x and ((foo)).
If you only want to match strings that only consist of a parenthesized substring, without any parentheses (matched or unmatched) in that substring, the following regex will do:
^\([^()]*\)$
^ and $ 'glue' the pattern to the beginning and end of the string, respectively, thereby excluding partial matches
note the arbitrary number of repetitions (…*) of the non-parenthesis character inside the parentheses.
note how special characters are not escaped inside a character set, but still have their literal meaning. (Putting backslashes in there would put literal backslashes in the character set. Or in this case out of the character set, due to the negation.)
note how the [ starting the character set isn't escaped, because we actually want its special meaning, rather than is literal meaning
The last two points might be specific to the dialect of regular expressions Python uses.
So this will match () and (foo) completely, but not (not even partially) (foo)bar), (foo(bar), x(foo), (foo)x or ()().

What is the use of the following statement in python regular expression?

I am new to python and i need to work on an existing python script. Can someone explain me what is the meaning of the following statement
pgre = re.compile("([^T]+)T([^\.]+)\.[^\s]+\s(\d+\.\d+):\s\[.+\]\s+(\d+)K->(\d+)K\((\d+)K\),\s(\d+\.\d+)\ssecs\]")

You need to consult the references for the exact meanings of each part of that regular expression, but the basic purpose of it is to parse the GC logging. Each parenthesized part of the expression () is a group that matches a useful part of the GC line.
For example, the start of the regex ([^T]+)T matches everything up to the first "T", and the grouped part returns the text before the "T", i.e. the date "2013-08-28"
The content of the group, [^T]+ means "at least one character that is not a T"
Patterns in square brackets [] are character classes - consult the references in the comments above for details. Note that your input text contains literal square brackets, so the pattern handles those with the \[ escape sequence - see below.
I think you can simplify ([^T]+)T to just (.+)T, incidentally.
Other useful sub-patterns:
\s matches whitespace
\d matches numeric digits
\. \( and \[ match literal periods, parentheses, and square braces, respectively, rather than interpreting them as special regex characters

Why '\A' in python regex doesn't work inside [ ]?

I was trying to get a regex which would match a word in the beginning of the line or after certain word. I tried:
r"[\A|my_word](smth)"
But it failed because it doesn't match the \A in that case. What's wrong with that?
It turns out that \A doesn't work inside []:
In [163]: type(re.search(r"\A123", "123"))
Out[163]: <type '_sre.SRE_Match'>
In [164]: type(re.search(r"[\A]123", "123"))
Out[164]: <type 'NoneType'>
But I don't understand why.
I'm using Python 2.6.6
EDIT:
After some comments I realized that the example I used with [\A|my_word] is bad. The actual expression is [\AV] to match either beginning of the string or V. The main problem I had is that I was curious why [\A] doesn't work.

My understanding of backslashes in bracket character classes was off, it seems, but even so, it is the case that [\A|my_word] is equivalent to [A|my_word] and will try to match a single one of A, |, m, y, _, w, o, r, or d before smth.
Here's a regular expression that should do what you want; unfortunately, a lookbehind can't be used in Python due to \A and my_word having different lengths, but a non-capturing group can be used instead: (?:\A|abc)(smth).
(You can also use ^ instead of \A if you want, though the usage may differ in multiline mode as ^ will also match at the start of each new line [or rather, immediately after every newline] in that mode.)

Anchors vs Character Classes
\A is an anchor that matches a position in the string - in this case the position before the first char in the string. Other anchors are \b: word boundary, ^: start of string/line, $: end of string/line, (?=...): Positive lookahead, (?!...): negative lookahead, etc. Anchors consume no characters and only match a position within the string.
[abc] is a character class that always matches exactly one character - in this case either a, b or c
Thus, placing an anchor inside a character class makes no sense.

[\A] matches a single character that is either a \ or an A. This is probably not what you wanted.

The \ character in the brackets clauses loses its special meaning as escaping character.
I.e. in [ ] it will treat as two characters: \ and A.
[REF]
Regex referencies:
The Single UNIX Specification
Python 2.6 - re module
UPDATE
Bracket expression is special case iteself, thus that special sequences like \A (almost control commands for regex) will work there is very unlikely. It's somehow unnatural...
ONE MORE THING
As stated from Python reference:
(brackets) Used to indicate a set of characters.
\A is special sequence which:
Matches only at the start of the string.
It is obviously not a character of any set, I know \n NEWLINE, but I've never heard about STARTLINE (maybe pretty one).
Also, for escapists:
You could even put ] into bracket without escaping it, if it comes right after the starting [ left bracket:
The pattern []] will match ']', for example.

Regular expression sub

I have a question about regular expression sub in python. So, I have some lines of code and what I want is to replace all floating point values eg: 2.0f,-1.0f...etc..to doubles 2.0,-1.0. I came up with this regular expression '[-+]?[0-9]*\.?[0-9]+f' and it finds what I need but I am not sure how to replace it?
so here's what I have:
# check if floating point value exists
if re.findall('[-+]?[0-9]*\.?[0-9]+f', line):
line = re.sub('[-+]?[0-9]*\.?[0-9]+f', ????? ,line)
I am not sure what to put under ????? such that it will replace what I found in '[-+]?[0-9]*\.?[0-9]+f' without the char f in the end of the string.
Also there might be more than one floating point values, which is why I used re.findall
Any help would be great. Thanks

Capture the part of the text you want to save in a capturing group and use the \1 substitution operator:
line = re.sub(r'([-+]?[0-9]*\.?[0-9]+)f', r'\1' ,line)
Note that findall (or any kind of searching) is unnecessary since re.sub will look for the pattern itself and return the string unchanged if there are no matches.
Now, for several regular expression writing tips:
Always use raw strings (r'...') for regular expressions and substitution strings, otherwise you will need to double your backslashes to escape them from Python's string parser. It is only by accident that you didn't need to do this for \., since . is not part of an escape sequence in Python strings.
Use \d instead of [0-9] to match a digit. They are equivalent, but \d is easier to recognize for "digit", while [0-9] needs to be visually verified.
Your regular expression will not recognize 10.f, which is likely a valid decimal number in your input. Matching floating-point numbers in various formats is trickier than it seems at first, but simple googling will reveal many reasonably complete solutions for this.
The re.X flag will allow you to add arbitrary whitespace and even comments to your regexp. With small regexps that can seem downright silly, but for large expressions the added clarity is a life-saver. (Your regular expression is close to the threshold.)
Here is an example of an extended regular expression that implements the above style tips:
line = re.sub(r'''
( [-+]?
(?: \d+ (?: \.\d* )? # 12 or 12. or 12.34
|
\.\d+ # .12
)
) f''',
r'\1', line, flags=re.X)
((?:...) is a non-capturing group, only used for precedence.)

This is my goto reference for all things regex.
http://www.regular-expressions.info/named.html
The result should be something like:
line = re.sub('(<first>[-+]?[0-9]*\).?[0-9]+f', '\g<first>', line)

Surround the part of the regex you want to "keep" in a "capture group", e.g.
'([-+]?[0-9]*\.?[0-9]+)f'
^ ^
And then you can refer to these capture groups using \1 in your substitution:
r'\1'
For future reference, you can have many capture groups, i.e. \2, \3, etc. by order of the opening parentheses.

python regular expresssion for a string

consider this string
prison break: proof of innocence (2006) {abduction (#1.10)}
i just want to know whether there is (# floating point value )} in the string or not
i tried few regular expressions like
re.search('\(\#+\f+\)\}',xyz)
and
re.search('\(\#+(\d\.\d)+\)\}',xyz)
nothing worked though...can someone suggest me something here

Try r'\(#\d+\.\d+\)\}'
The (, ), ., and } are all special metacharacters, that's why they're preceded by \, so they're matched literally instead.
You also need to apply the + repetition at the right element. Here it's attached to the \d -- the shorthand for digit character class -- to mean that only the digits can appear one-or-more times.
The use of r'raw string literals' makes it easier to work with regex patterns because you don't have to escape backslashes excessively.
See also
What exactly do u and r string flags in Python do, and what are raw string literals?
Variations
For instructional purposes, let's consider a few variations. This will show a few basic features of regex. Let's first consider one of the attempted patterns:
\(\#+(\d\.\d)+\)\}
Let's space out the parts for readability:
\( \#+ ( \d \. \d )+ \) \}
\__________/
this is one group, repeated with +
So this pattern matches:
A literal (, followed by one-or-more #
Followed by one-or-more of:
A digit, a literal dot, and a digit
Followed by a literal )}
Thus, the pattern will match e.g. (###1.23.45.6)} (as seen on rubular.com). Obviously this is not the pattern we want.
Now let's try to modify the solution pattern and say that perhaps we also want to allow just a sequence of digits, without the subsequent period and following digits. We can do this by grouping that part (…), and making it optional with ?.
BEFORE
\(#\d+\.\d+\)\}
\___/
let's make this optional! (…)?
AFTER
\(#\d+(\.\d+)?\)\}
Now the pattern matches e.g. (#1.23)} as well as e.g. (#666)} (as seen on rubular.com).
References
regular-expressions.info - Optional, Brackets for Grouping

"Escape everything" and use raw-literal syntax for safety:
>>> s='prison break: proof of innocence (2006) {abduction (#1.10)}'
>>> re.search(r'\(\#\d+\.\d+\)\}', s)
<_sre.SRE_Match object at 0xec950>
>>> _.group()
'(#1.10)}'
>>>
This assumes that by "floating point value" you mean "one or more digits, a dot, one or more digits", and is not tolerant of other floating point syntax variations, multiple hashes (which you appear from your RE patterns to want to support but don't mention in your Q's text), arbitrary whitespace among the relevant parts (again, unclear from your Q whether you need it), ... -- some issues can be adjusted pretty easily, others "not so much" (it's particularly hard to guess what gamut of FP syntax variations you want to support, for example).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How does the regex "\" character and grouping "()" character work together? - python

Related

Python Regex match parenthesis but not nested parenthesis

What is the use of the following statement in python regular expression?

Why '\A' in python regex doesn't work inside [ ]?

Regular expression sub

python regular expresssion for a string

Categories

Resources