Non-greedy regex not matching as expected - python

Given the following string as input:
[2015/06/09 14:21:59] mod=syn|cli=192.168.1.99/49244|srv=192.168.1.100/80|subj=cli|os=Windows 7 or 8|dist=0|params=none|raw_sig=4:128+0:0:1460:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0
I'm trying to match the value of subj, ie: in the above case the expected output would be cli
I don't understand why my regex is not working:
subj = re.match(r"(.*)subj=(.*?)|(.*)", line).group(2)
From what I can tell, the second group in here should be cli but I'm getting an empty result.

The | has special meaning in regex (Which creates alternations ) , hence escape it as
>> re.match(r"(.*)subj=(.*?)\|", line).group(2)
'cli'
Another Solution
You can use re.search() so that you can get rid of the groups at the start of subj and that after the |
Example
>>> re.search(r"subj=(.*?)\|", line).group(1)
'cli'
Here we use group(1) since there is only one group that is being captured instead of three as in previous version.
Read about the differences between search and match
Complex version
You can even get rid of all the capturing if you are using look arounds
>>> re.search(r"(?<=subj=).*?(?=\|)", line).group(0)
'cli'
(?<=subj=) Checks if the string matched by .*? is preceded by subj.
.*? Matches anything, non greedy matching.
(?=\|) Check if this anything is followed by a |.

Regex101
I'd recommend using the following regex, because it will provide better performance with two additions/substitutions:
adding the beginning of line character ^
adding the negating group [^\|]* is faster than (.*)?
Code
subj = re.match(r"^.*\|subj=([^\|]*)", line).group(1)
regex:
^.*\|subj=([^\|]*)
Debuggex Demo

You need to escape |.. Use the following:
subj = re.match(r"(.*)subj=(.*?)\|(.*)", line).group(2)
^

The pipe sign | needs to be escaped, like so:
subj = re.match(r"(.*)subj=(.*?)\|(.*)", s).group(2)

I would use a negated class [^|]* with re.search for better performance:
import re
p = re.compile(r'^(.*)subj=([^|]*)\|(.*)$')
test_str = "[2015/06/09 14:21:59] mod=syn|cli=192.168.1.99/49244|srv=192.168.1.100/80|subj=cli|os=Windows 7 or 8|dist=0|params=none|raw_sig=4:128+0:0:1460:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0"
print re.search(p, test_str).group(2)
See IDEONE demo
Note I am not using both lazy and greedy quantifiers in the regex (it is not advisable usually).
The pipe symbol must be escaped to be treated as a literal | symbol.
REGEX EXPLANATION:
^ - Start of string
(.*) - The first capturing group that matches characters from the beginning up to
subj= - A literal string subj=
([^|]*) - The second capturing group matching any characters other than a literal pipe (inside a character class, it does not need escaping)
\| - A literal pipe (must be escaped)
(.*) - The third capturing group (if you need to get the string after up to the end.
$ - End of string

Related

Regex to match following pattern in SQL query

I am trying to extract parts of a MySQL query to get the information I want.
I used this code / regex in Python:
import re
query = "SELECT `asd`.`ssss` as `column1`, `ss`.`wwwwwww` from `table`"
table_and_columns = re.findall('\`.*?`[.]\`.*?`',query)
My expected output:
['`asd`.`ssss`', `ss`.`wwwwwww`']
My real output:
['`asd`.`ssss`', '`column1`, `ss`.`wwwwwww`']
Can anybody help me and explain me where I went wrong?
The regex should only find the ones that have two strings like asd and a dot in the middle.
PS: I know that this is not a valid query.
The dot . can also match a backtick, so the pattern starts by matching a backtick and is able to match all chars until it reaches the literal dot in [.]
There is no need to use non greedy quantifiers, you can use a negated character class only prevent crossing the backtick boundary.
`[^`]*`\.`[^`]*`
Regex demo
The asterix * matches 0 or more times. If there has to be at least a single char, and newlines and spaces are unwanted, you could add \s to prevent matching whitespace chars and use + to match 1 or more times.
`[^`\s]+`\.`[^`\s]+`
Regex demo | Python demo
For example
import re
query = "SELECT `asd`.`ssss` as `column1`, `ss`.`wwwwwww` from `table`"
table_and_columns = re.findall('`[^`\s]+`\.`[^`\s]+`',query)
print(table_and_columns)
Output
['`asd`.`ssss`', '`ss`.`wwwwwww`']
Please try below regex. Greedy nature of .* from left to right is what caused issue.
Instead you should search for [^`]*
`[^`]*?`\.`[^`]*?`
Demo
The thing is that
.*? matches any character (except for line terminators) even whitespaces.
Also as you're already using * which means either 0 or unlimited occurrences,not sure you need to use ?.
So this seems to work:
\`\S+\`[.]\`\S+\`
where \S is any non-whitespace character.
You always can check you regexes using https://regex101.com

Regex for parsing uid from URL

I am trying to parse UIDs from URLs. However regex is not something I am good at so seeking for some help.
Example Input:
https://example.com/d/iazs9fEil/somethingelse?foo=bar
Example Output:
iazs9fEil
What I've tried so far is
([/d/]+[\d\x])\w+
Which somehow works, but returns in with the /d/ prefix, so the output is /d/iazs9fEil.
How to change the regex to not contain the /d/ prefix?
EDIT:
I've tried this regex ([^/d/]+[\d\x])\w+ which outputs the correct string which is iazs9fEil, but also returns the rest of the url, so here it is somethingelse?foo=bar
In short, you may use
match = re.search(r'/d/(\w+)', your_string) # Look for a match
if match: # Check if there is a match first
print(match.group(1)) # Now, get Group 1 value
See this regex demo and a regex graph:
NOTE
/ is not any special metacharacter, do not escape it in Python string patterns
([/d/]+[\d\x])\w+ matches and captures into Group 1 any one or more slashes or digits (see [/d/]+, a positive character class) and then a digit or (here, Python shows an error: sre_contants.error incomplete escape \x, probably it could parse it as x, but it is not the case), and then matches 1+ word chars. You put the /d/ into a character class and it stopped matching a char sequence, [/d/]+ matches slashes and digits in any order and amount, and certainly places this string into Group 1.
Try (?<=/d/)[^/]+
Explanation:
(?<=/d/) - positive lookbehind, assure that what's preceeding is /d/
[^/]+ - match one or more characters other than /, so it matches everything until /
Demo
You could use a capturing group:
https?://.*?/d/([^/\s]+)
Regex demo

How to search/extract patterns in a string?

I have a pattern I want to search for in my message.
The patterns are:
1. "aaa-b3-c"
2. "a3-b6-c"
3. "aaaa-bb-c"
I know how to search for one of the patterns, but how do I search for all 3?
Also, how do you identify and extract dates in this format: 5/21 or 5/21/2019.
found = re.findall(r'.{3}-.{2}-.{1}', message)
Try this :
found = re.findall(r'a{2,4}-b{2}-c', message)
You could use
a{2,4}-bb-c
as a pattern.
Now you need to check the match for truthiness:
match = re.search(pattern, string)
if match:
# do sth. here
As from Python 3.8 you can use the walrus operator as in
if (match := re.search(pattern, string)) is not None:
# do sth. here
try this:
re.findall(r'a.*-b.*-c',message)
The first part could be a quantifier {2,4} instead of 3. The dot matches any character except a newline, [a-zA-Z0-9] will match a upper or lowercase char a-z or a digit:
\b[a-zA-Z0-9]{2,4}-[a-zA-Z0-9]{2}-[a-zA-Z0-9]\b
Demo
You could add word boundaries \b or anchors ^ and $ on either side if the characters should not be part of a longer word.
For the second pattern you could also use \d with a quantifier to match a digit and an optional patter to match the part with / and 4 digits:
\d{1,2}/\d{2}(?:/\d{4})?
Regex demo
Note that the format does not validate a date itself. Perhaps this page can help you creating / customize a more specific date format.
Here, we might just want to write three expressions, and swipe our inputs from left to right just to be safe and connect them using logical ORs and in case we had more patterns we can simply add to it, similar to:
([a-z]+-[a-z]+[0-9]+-[a-z]+)
([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])
([a-z]+-[a-z]+-[a-z])
which would add to:
([a-z]+-[a-z]+[0-9]+-[a-z]+)|([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])|([a-z]+-[a-z]+-[a-z])
Then, we might want to bound it with start and end chars:
^([a-z]+-[a-z]+[0-9]+-[a-z]+)$|^([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])$|^([a-z]+-[a-z]+-[a-z])$
or
^(([a-z]+-[a-z]+[0-9]+-[a-z]+)|([a-z]+[0-9]+-[a-z]+[0-9]+-[a-z])|([a-z]+-[a-z]+-[a-z]))$
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:

Regex, not statement

Heyho,
I have the regex
([ ;(\{\}),\[\'\"]?)(_[a-zA-Z_\-0-9]*)([ =;\/*\-+\]\"\'\}\{,]?)
to match every occurrence of
_var
Problem is that it also matches strings like
test_var
I tried to add a new matching group negating any word character but it didn't worked properly.
Can someone figure out what I have to do to not match strings like var_var?
Thanks for help!
You can use the following "fix":
([[ ;(){},'"]?)(\b_[a-zA-Z_0-9-]*\b)([] =;/*+"'{},-]?)
^ ^
See regex demo
The word boundary \b is an anchor that asserts the position between a word and a non-word boundary. That means your _var will never match if preceded with a letter, a digit, or a . Also, I removed overescaping inside the character classes in the optional capturing groups. Note the so-called "smart placement" of hyphens and square brackets that for a Python regex might be not that important, but is still a best practice in writing regexes. Also, in Python regex you don't need to escape / since there are no regex delimiters there.
And one more hint: without u modifier, \w matches [a-zA-Z0-9_], so you can write the regex as
([[ ;(){},'"]?)(\b_[\w-]*\b)([] =;/*+"'{},-]?)
See regex demo 2.
And an IDEONE demo (note the use of r'...'):
import re
p = re.compile(r'([[ ;(){},\'"]?)(\b_[\w-]*\b)([] =;/*+"\'{},-]?)')
test_str = "Some text _var and test_var"
print (re.findall(p, test_str))

Is there a way to refer to the entire matched expression in re.sub without the use of a group?

Suppose I want to prepend all occurrences of a particular expression with a character such as \.
In sed, it would look like this.
echo '__^^^%%%__FooBar' | sed 's/[_^%]/\\&/g'
Note that the & character is used to represent the original matched expression.
I have looked through the regex docs and the regex howto, but I do not see an equivalent to the & character that can be used to substitute in the matched expression.
The only workaround I have found is to use the an extra set of () to group the expression and then refernece the group, as follows.
import re
line = "__^^^%%%__FooBar"
print re.sub("([_%^$])", r"\\\1", line)
Is there a clean way to reference the entire matched expression without the extra group creation?
From the docs:
The backreference \g<0> substitutes in the entire substring matched by the RE.
Example:
>>> print re.sub("[_%^$]", r"\\\g<0>", line)
\_\_\^\^\^\%\%\%\_\_FooBar
You could get the result also by using Positive lookahead .
>>> print re.sub("(?=[_%^$])", r"\\", line)
\_\_\^\^\^\%\%\%\_\_FooBar

Categories