I am new to python and i need to work on an existing python script. Can someone explain me what is the meaning of the following statement
pgre = re.compile("([^T]+)T([^\.]+)\.[^\s]+\s(\d+\.\d+):\s\[.+\]\s+(\d+)K->(\d+)K\((\d+)K\),\s(\d+\.\d+)\ssecs\]")
You need to consult the references for the exact meanings of each part of that regular expression, but the basic purpose of it is to parse the GC logging. Each parenthesized part of the expression () is a group that matches a useful part of the GC line.
For example, the start of the regex ([^T]+)T matches everything up to the first "T", and the grouped part returns the text before the "T", i.e. the date "2013-08-28"
The content of the group, [^T]+ means "at least one character that is not a T"
Patterns in square brackets [] are character classes - consult the references in the comments above for details. Note that your input text contains literal square brackets, so the pattern handles those with the \[ escape sequence - see below.
I think you can simplify ([^T]+)T to just (.+)T, incidentally.
Other useful sub-patterns:
\s matches whitespace
\d matches numeric digits
\. \( and \[ match literal periods, parentheses, and square braces, respectively, rather than interpreting them as special regex characters
Related
I have a string. The end is different, such as index.php?test=1&list=UL or index.php?list=UL&more=1. The one thing I'm looking for is &list=.
How can I match it, whether it's in the middle of the string or it's at the end? So far I've got [&|\?]list=.*?([&|$]), but the ([&|$]) part doesn't actually work; I'm trying to use that to match either & or the end of the string, but the end of the string part doesn't work, so this pattern matches the second example but not the first.
Use:
/(&|\?)list=.*?(&|$)/
Note that when you use a bracket expression, every character within it (with some exceptions) is going to be interpreted literally. In other words, [&|$] matches the characters &, |, and $.
In short
Any zero-width assertions inside [...] lose their meaning of a zero-width assertion. [\b] does not match a word boundary (it matches a backspace, or, in POSIX, \ or b), [$] matches a literal $ char, [^] is either an error or, as in ECMAScript regex flavor, any char. Same with \z, \Z, \A anchors.
You may solve the problem using any of the below patterns:
[&?]list=([^&]*)
[&?]list=(.*?)(?=&|$)
[&?]list=(.*?)(?![^&])
If you need to check for the "absolute", unambiguous string end anchor, you need to remember that is various regex flavors, it is expressed with different constructs:
[&?]list=(.*?)(?=&|$) - OK for ECMA regex (JavaScript, default C++ `std::regex`)
[&?]list=(.*?)(?=&|\z) - OK for .NET, Go, Onigmo (Ruby), Perl, PCRE (PHP, base R), Boost, ICU (R `stringr`), Java/Andorid
[&?]list=(.*?)(?=&|\Z) - OK for Python
Matching between a char sequence and a single char or end of string (current scenario)
The .*?([YOUR_SINGLE_CHAR_DELIMITER(S)]|$) pattern (suggested by João Silva) is rather inefficient since the regex engine checks for the patterns that appear to the right of the lazy dot pattern first, and only if they do not match does it "expand" the lazy dot pattern.
In these cases it is recommended to use negated character class (or bracket expression in the POSIX talk):
[&?]list=([^&]*)
See demo. Details
[&?] - a positive character class matching either & or ? (note the relationships between chars/char ranges in a character class are OR relationships)
list= - a substring, char sequence
([^&]*) - Capturing group #1: zero or more (*) chars other than & ([^&]), as many as possible
Checking for the trailing single char delimiter presence without returning it or end of string
Most regex flavors (including JavaScript beginning with ECMAScript 2018) support lookarounds, constructs that only return true or false if there patterns match or not. They are crucial in case consecutive matches that may start and end with the same char are expected (see the original pattern, it may match a string starting and ending with &). Although it is not expected in a query string, it is a common scenario.
In that case, you can use two approaches:
A positive lookahead with an alternation containing positive character class: (?=[SINGLE_CHAR_DELIMITER(S)]|$)
A negative lookahead with just a negative character class: (?![^SINGLE_CHAR_DELIMITER(S)])
The negative lookahead solution is a bit more efficient because it does not contain an alternation group that adds complexity to matching procedure. The OP solution would look like
[&?]list=(.*?)(?=&|$)
or
[&?]list=(.*?)(?![^&])
See this regex demo and another one here.
Certainly, in case the trailing delimiters are multichar sequences, only a positive lookahead solution will work since [^yes] does not negate a sequence of chars, but the chars inside the class (i.e. [^yes] matches any char but y, e and s).
Is it possible to match parenthesis like () but not allowing nesting? In other words, I want my regex to match () but not (())
The regex that I am trying is
\(\[^\(\)])
but it does not seem to be working. Can someone explain to me what I'm doing wrong?
If (foo) in x(foo)x shall be matched, but (foo) in ((foo)) not, what you want is not possible with regular expressions, as regular expressions represent regular grammars and all regular grammars are context free. But context (or 'state', as Jonathon Reinhart called it in his comment) is necessary for the distinction between the (foo) substrings in x(foo)x and ((foo)).
If you only want to match strings that only consist of a parenthesized substring, without any parentheses (matched or unmatched) in that substring, the following regex will do:
^\([^()]*\)$
^ and $ 'glue' the pattern to the beginning and end of the string, respectively, thereby excluding partial matches
note the arbitrary number of repetitions (…*) of the non-parenthesis character inside the parentheses.
note how special characters are not escaped inside a character set, but still have their literal meaning. (Putting backslashes in there would put literal backslashes in the character set. Or in this case out of the character set, due to the negation.)
note how the [ starting the character set isn't escaped, because we actually want its special meaning, rather than is literal meaning
The last two points might be specific to the dialect of regular expressions Python uses.
So this will match () and (foo) completely, but not (not even partially) (foo)bar), (foo(bar), x(foo), (foo)x or ()().
I am trying to see which statements the following pattern matches:
\(*[0-9]{3}\)*-*[0-9]{3}\d\d\d+
I am a little confused because the grouping characters () have a \ before it. Does this mean that the statement must have a ( and )? Would that mean the statements without ( or ) be unmatched?
Statements:
'4046782347'
'(123)1247890'
'456900900'
'(678)2001236'
'4041231234'
'(404123123'
Context is important:
re.match(r'\(', content) matches a literal parenthesis.
re.match(r'\(*', content) matches 0 or more literal parentheses, thus making the parens optional (and allowing more than one of them, but that's clearly a bug).
Since the intended behavior isn't "0 or more" but rather "0 or 1", this should probably be written r'\(?' instead.
That said, there's a whole lot about this regex that's silly. I'd consider instead:
[(]?\d{3}[)]?-?\d{6,}
Using [(]? avoids backslashes, and consequently is easier to read whether it's rendered by str() or repr() (which escapes backslashes).
Mixing [0-9] and \d is silly; better to pick one and stick with it.
Using * in place of ? is silly, unless you really want to match (((123))456-----7890.
\d{3}\d\d\d+ matches three digits, then three or more additional digits. Why not just match six or more digits in the first place?
Normally, the parentheses would act as grouping characters, however regex metacharacters are reduced simply to the raw characters when preceded by a backslash. From the Python docs:
As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.
In your case, the statements don't need parentheses in order to match, as each \( and \) in the expression is followed by a *, which means that the previous character can be matched any number of times, including none at all. From the Python docs:
* doesn’t match the literal character *; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once.
Thus the statements with or without parentheses around the first 3 digits may match.
Source: https://docs.python.org/2/howto/regex.html
I have a question about regular expression sub in python. So, I have some lines of code and what I want is to replace all floating point values eg: 2.0f,-1.0f...etc..to doubles 2.0,-1.0. I came up with this regular expression '[-+]?[0-9]*\.?[0-9]+f' and it finds what I need but I am not sure how to replace it?
so here's what I have:
# check if floating point value exists
if re.findall('[-+]?[0-9]*\.?[0-9]+f', line):
line = re.sub('[-+]?[0-9]*\.?[0-9]+f', ????? ,line)
I am not sure what to put under ????? such that it will replace what I found in '[-+]?[0-9]*\.?[0-9]+f' without the char f in the end of the string.
Also there might be more than one floating point values, which is why I used re.findall
Any help would be great. Thanks
Capture the part of the text you want to save in a capturing group and use the \1 substitution operator:
line = re.sub(r'([-+]?[0-9]*\.?[0-9]+)f', r'\1' ,line)
Note that findall (or any kind of searching) is unnecessary since re.sub will look for the pattern itself and return the string unchanged if there are no matches.
Now, for several regular expression writing tips:
Always use raw strings (r'...') for regular expressions and substitution strings, otherwise you will need to double your backslashes to escape them from Python's string parser. It is only by accident that you didn't need to do this for \., since . is not part of an escape sequence in Python strings.
Use \d instead of [0-9] to match a digit. They are equivalent, but \d is easier to recognize for "digit", while [0-9] needs to be visually verified.
Your regular expression will not recognize 10.f, which is likely a valid decimal number in your input. Matching floating-point numbers in various formats is trickier than it seems at first, but simple googling will reveal many reasonably complete solutions for this.
The re.X flag will allow you to add arbitrary whitespace and even comments to your regexp. With small regexps that can seem downright silly, but for large expressions the added clarity is a life-saver. (Your regular expression is close to the threshold.)
Here is an example of an extended regular expression that implements the above style tips:
line = re.sub(r'''
( [-+]?
(?: \d+ (?: \.\d* )? # 12 or 12. or 12.34
|
\.\d+ # .12
)
) f''',
r'\1', line, flags=re.X)
((?:...) is a non-capturing group, only used for precedence.)
This is my goto reference for all things regex.
http://www.regular-expressions.info/named.html
The result should be something like:
line = re.sub('(<first>[-+]?[0-9]*\).?[0-9]+f', '\g<first>', line)
Surround the part of the regex you want to "keep" in a "capture group", e.g.
'([-+]?[0-9]*\.?[0-9]+)f'
^ ^
And then you can refer to these capture groups using \1 in your substitution:
r'\1'
For future reference, you can have many capture groups, i.e. \2, \3, etc. by order of the opening parentheses.
consider this string
prison break: proof of innocence (2006) {abduction (#1.10)}
i just want to know whether there is (# floating point value )} in the string or not
i tried few regular expressions like
re.search('\(\#+\f+\)\}',xyz)
and
re.search('\(\#+(\d\.\d)+\)\}',xyz)
nothing worked though...can someone suggest me something here
Try r'\(#\d+\.\d+\)\}'
The (, ), ., and } are all special metacharacters, that's why they're preceded by \, so they're matched literally instead.
You also need to apply the + repetition at the right element. Here it's attached to the \d -- the shorthand for digit character class -- to mean that only the digits can appear one-or-more times.
The use of r'raw string literals' makes it easier to work with regex patterns because you don't have to escape backslashes excessively.
See also
What exactly do u and r string flags in Python do, and what are raw string literals?
Variations
For instructional purposes, let's consider a few variations. This will show a few basic features of regex. Let's first consider one of the attempted patterns:
\(\#+(\d\.\d)+\)\}
Let's space out the parts for readability:
\( \#+ ( \d \. \d )+ \) \}
\__________/
this is one group, repeated with +
So this pattern matches:
A literal (, followed by one-or-more #
Followed by one-or-more of:
A digit, a literal dot, and a digit
Followed by a literal )}
Thus, the pattern will match e.g. (###1.23.45.6)} (as seen on rubular.com). Obviously this is not the pattern we want.
Now let's try to modify the solution pattern and say that perhaps we also want to allow just a sequence of digits, without the subsequent period and following digits. We can do this by grouping that part (…), and making it optional with ?.
BEFORE
\(#\d+\.\d+\)\}
\___/
let's make this optional! (…)?
AFTER
\(#\d+(\.\d+)?\)\}
Now the pattern matches e.g. (#1.23)} as well as e.g. (#666)} (as seen on rubular.com).
References
regular-expressions.info - Optional, Brackets for Grouping
"Escape everything" and use raw-literal syntax for safety:
>>> s='prison break: proof of innocence (2006) {abduction (#1.10)}'
>>> re.search(r'\(\#\d+\.\d+\)\}', s)
<_sre.SRE_Match object at 0xec950>
>>> _.group()
'(#1.10)}'
>>>
This assumes that by "floating point value" you mean "one or more digits, a dot, one or more digits", and is not tolerant of other floating point syntax variations, multiple hashes (which you appear from your RE patterns to want to support but don't mention in your Q's text), arbitrary whitespace among the relevant parts (again, unclear from your Q whether you need it), ... -- some issues can be adjusted pretty easily, others "not so much" (it's particularly hard to guess what gamut of FP syntax variations you want to support, for example).