I am trying to extract parts of a MySQL query to get the information I want.
I used this code / regex in Python:
import re
query = "SELECT `asd`.`ssss` as `column1`, `ss`.`wwwwwww` from `table`"
table_and_columns = re.findall('\`.*?`[.]\`.*?`',query)
My expected output:
['`asd`.`ssss`', `ss`.`wwwwwww`']
My real output:
['`asd`.`ssss`', '`column1`, `ss`.`wwwwwww`']
Can anybody help me and explain me where I went wrong?
The regex should only find the ones that have two strings like asd and a dot in the middle.
PS: I know that this is not a valid query.
The dot . can also match a backtick, so the pattern starts by matching a backtick and is able to match all chars until it reaches the literal dot in [.]
There is no need to use non greedy quantifiers, you can use a negated character class only prevent crossing the backtick boundary.
`[^`]*`\.`[^`]*`
Regex demo
The asterix * matches 0 or more times. If there has to be at least a single char, and newlines and spaces are unwanted, you could add \s to prevent matching whitespace chars and use + to match 1 or more times.
`[^`\s]+`\.`[^`\s]+`
Regex demo | Python demo
For example
import re
query = "SELECT `asd`.`ssss` as `column1`, `ss`.`wwwwwww` from `table`"
table_and_columns = re.findall('`[^`\s]+`\.`[^`\s]+`',query)
print(table_and_columns)
Output
['`asd`.`ssss`', '`ss`.`wwwwwww`']
Please try below regex. Greedy nature of .* from left to right is what caused issue.
Instead you should search for [^`]*
`[^`]*?`\.`[^`]*?`
Demo
The thing is that
.*? matches any character (except for line terminators) even whitespaces.
Also as you're already using * which means either 0 or unlimited occurrences,not sure you need to use ?.
So this seems to work:
\`\S+\`[.]\`\S+\`
where \S is any non-whitespace character.
You always can check you regexes using https://regex101.com
Related
I currently have this regular expression that I use to match the result of an SQL query: [^\\n]+(?=\\r\\n\\r\\n\(1 rows affected\)). However, it is not working as intended....
'\r\n----------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------\r\nCS: GPS
on Date.
\r\n\r\n(1 rows affected)\r\n'
What I get from the expression above is Date whereas I would want to match CS: GPS on Date. It's fine if there's leading and following spaces... Nothing Python's trim can't handle. How do I change my regular expression so that the match is done properly?
Thanks in advance.
Edit: The Python version I am using is Python 3.6
You get your current match because the character class [^\\n]+ matches 1+ times any char except \ or n.
Then the positive lookahead asserts what is on the right is \r\n\r\n(1 rows affected) which results in matching Date.
See https://regex101.com/r/wDzq8l/1
You could use a non greedy .+? in a capturing group and match what follows instead of using a positive lookahead.
In the code use re.DOTALL to let the dot match a newline.
-\\r\\n(.+?) ?\\r\\n\\r\\n\(\d+ rows affected\)
Regex demo
Maybe, some expression similar to:
-{5,}\s*([A-Za-z][^.]+\.)
would extract that or somewhat similar to that.
Demo
Test
import re
regex = r'-{5,}\s*([A-Za-z][^.]+\.)'
string = '''
----------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
--------------------------------------
CS: GPS
on Date.
\r\n\r\n(1 rows affected)\r\n
'''
print(re.findall(regex, string, re.DOTALL))
Output
['CS: GPS\non Date.']
If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.
I am often faced with patterns where the part which is interesting is delimited by a specific character, the rest does not matter. A typical example:
/dev/sda1 472437724 231650856 216764652 52% /
I would like to extract 52 (which can also be 9, or 100 - so 1 to 3 digits) by saying "match anything, then when you get to % (which is unique in that line), see before for the matches to extract".
I tried to code this as .*(\d*)%.* but the group is not matched:
.* match anything, any number of times
% ... until you get to the litteral % (the \d is also matched by .* but my understanding is that once % is matched, the regex engine will work backwards, since it now has an "anchor" on which to analyze what was before -- please tell if this reasoning is incorrect, thank you)
(\d*) ... and now before that % you had a (\d*) to match and group
.* ... and the rest does not matter (match everything)
Your regex does not work because . matches too much, and the group matches too little. The group \d* can basically match nothing because of the * quantifier, leaving everything matched by the ..
And your description of .* is somewhat incorrect. It actually matches everything until the end, and moves backwards until the thing after it ((\d*).*) matches. For more info, see here.
In fact, I think your text can be matched simply by:
(\d{1,3})%
And getting group 1.
The logic of "keep looking until you find..." is kind of baked into the regex engine, so you don't need to explicitly say .* unless you want it in the match. In this case you just want the number before the % right?
If you are just looking to extract just the number then I would use:
import re
pattern = r"\d*(?=%)"
string = "/dev/sda1 472437724 231650856 216764652 52% /"
returnedMatches = re.findall(pattern, string)
The regex expression does a positive look ahead for the special character
In your pattern this part .* matches until the end of the string. Then it backtracks giving up as least as possible till it can match 0+ times a digit and a %.
The % is matched because matching 0+ digits is ok. Then you match again .* till the end of the string. There is a capturing group, only it is empty.
What you might do is add a word boundary or a space before the digits:
.* (\d{1,3})%.* or .*\b(\d{1,3})%.*
Regex demo 1 Or regex demo 2
Note that using .* (greedy) you will get the last instance of the digits and the % sign.
If you would make it non greedy, you would match the first occurrence:
.*?(\d{1,3})%.*
Regex demo
By default regex matches as greedily as possible. The initial .* in your regex sequence is matching everything up to the %:
"/dev/sda1 472437724 231650856 216764652 52"
This is acceptable for the regex, because it just chooses to have the next pattern, (\d*), match 0 characters.
In this scenario a couple of options could work for you. I would most recommend to use the previous spaces to define a sequence which "starts with a single space, contains any number of digits in the middle, and ends with a percentage symbol":
' (\d*)%'
Try this:
.*(\b\d{1,3}(?=\%)).*
demo
I am trying to parse UIDs from URLs. However regex is not something I am good at so seeking for some help.
Example Input:
https://example.com/d/iazs9fEil/somethingelse?foo=bar
Example Output:
iazs9fEil
What I've tried so far is
([/d/]+[\d\x])\w+
Which somehow works, but returns in with the /d/ prefix, so the output is /d/iazs9fEil.
How to change the regex to not contain the /d/ prefix?
EDIT:
I've tried this regex ([^/d/]+[\d\x])\w+ which outputs the correct string which is iazs9fEil, but also returns the rest of the url, so here it is somethingelse?foo=bar
In short, you may use
match = re.search(r'/d/(\w+)', your_string) # Look for a match
if match: # Check if there is a match first
print(match.group(1)) # Now, get Group 1 value
See this regex demo and a regex graph:
NOTE
/ is not any special metacharacter, do not escape it in Python string patterns
([/d/]+[\d\x])\w+ matches and captures into Group 1 any one or more slashes or digits (see [/d/]+, a positive character class) and then a digit or (here, Python shows an error: sre_contants.error incomplete escape \x, probably it could parse it as x, but it is not the case), and then matches 1+ word chars. You put the /d/ into a character class and it stopped matching a char sequence, [/d/]+ matches slashes and digits in any order and amount, and certainly places this string into Group 1.
Try (?<=/d/)[^/]+
Explanation:
(?<=/d/) - positive lookbehind, assure that what's preceeding is /d/
[^/]+ - match one or more characters other than /, so it matches everything until /
Demo
You could use a capturing group:
https?://.*?/d/([^/\s]+)
Regex demo
Heyho,
I have the regex
([ ;(\{\}),\[\'\"]?)(_[a-zA-Z_\-0-9]*)([ =;\/*\-+\]\"\'\}\{,]?)
to match every occurrence of
_var
Problem is that it also matches strings like
test_var
I tried to add a new matching group negating any word character but it didn't worked properly.
Can someone figure out what I have to do to not match strings like var_var?
Thanks for help!
You can use the following "fix":
([[ ;(){},'"]?)(\b_[a-zA-Z_0-9-]*\b)([] =;/*+"'{},-]?)
^ ^
See regex demo
The word boundary \b is an anchor that asserts the position between a word and a non-word boundary. That means your _var will never match if preceded with a letter, a digit, or a . Also, I removed overescaping inside the character classes in the optional capturing groups. Note the so-called "smart placement" of hyphens and square brackets that for a Python regex might be not that important, but is still a best practice in writing regexes. Also, in Python regex you don't need to escape / since there are no regex delimiters there.
And one more hint: without u modifier, \w matches [a-zA-Z0-9_], so you can write the regex as
([[ ;(){},'"]?)(\b_[\w-]*\b)([] =;/*+"'{},-]?)
See regex demo 2.
And an IDEONE demo (note the use of r'...'):
import re
p = re.compile(r'([[ ;(){},\'"]?)(\b_[\w-]*\b)([] =;/*+"\'{},-]?)')
test_str = "Some text _var and test_var"
print (re.findall(p, test_str))
Given the following string as input:
[2015/06/09 14:21:59] mod=syn|cli=192.168.1.99/49244|srv=192.168.1.100/80|subj=cli|os=Windows 7 or 8|dist=0|params=none|raw_sig=4:128+0:0:1460:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0
I'm trying to match the value of subj, ie: in the above case the expected output would be cli
I don't understand why my regex is not working:
subj = re.match(r"(.*)subj=(.*?)|(.*)", line).group(2)
From what I can tell, the second group in here should be cli but I'm getting an empty result.
The | has special meaning in regex (Which creates alternations ) , hence escape it as
>> re.match(r"(.*)subj=(.*?)\|", line).group(2)
'cli'
Another Solution
You can use re.search() so that you can get rid of the groups at the start of subj and that after the |
Example
>>> re.search(r"subj=(.*?)\|", line).group(1)
'cli'
Here we use group(1) since there is only one group that is being captured instead of three as in previous version.
Read about the differences between search and match
Complex version
You can even get rid of all the capturing if you are using look arounds
>>> re.search(r"(?<=subj=).*?(?=\|)", line).group(0)
'cli'
(?<=subj=) Checks if the string matched by .*? is preceded by subj.
.*? Matches anything, non greedy matching.
(?=\|) Check if this anything is followed by a |.
Regex101
I'd recommend using the following regex, because it will provide better performance with two additions/substitutions:
adding the beginning of line character ^
adding the negating group [^\|]* is faster than (.*)?
Code
subj = re.match(r"^.*\|subj=([^\|]*)", line).group(1)
regex:
^.*\|subj=([^\|]*)
Debuggex Demo
You need to escape |.. Use the following:
subj = re.match(r"(.*)subj=(.*?)\|(.*)", line).group(2)
^
The pipe sign | needs to be escaped, like so:
subj = re.match(r"(.*)subj=(.*?)\|(.*)", s).group(2)
I would use a negated class [^|]* with re.search for better performance:
import re
p = re.compile(r'^(.*)subj=([^|]*)\|(.*)$')
test_str = "[2015/06/09 14:21:59] mod=syn|cli=192.168.1.99/49244|srv=192.168.1.100/80|subj=cli|os=Windows 7 or 8|dist=0|params=none|raw_sig=4:128+0:0:1460:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0"
print re.search(p, test_str).group(2)
See IDEONE demo
Note I am not using both lazy and greedy quantifiers in the regex (it is not advisable usually).
The pipe symbol must be escaped to be treated as a literal | symbol.
REGEX EXPLANATION:
^ - Start of string
(.*) - The first capturing group that matches characters from the beginning up to
subj= - A literal string subj=
([^|]*) - The second capturing group matching any characters other than a literal pipe (inside a character class, it does not need escaping)
\| - A literal pipe (must be escaped)
(.*) - The third capturing group (if you need to get the string after up to the end.
$ - End of string