Python Regex (Lookaround)

Python Regex (Lookaround) - python

I am trying to write Python Regex.
First I have read each line of the file into a list.Then I am looping through the list.
Q1. I want to capture when the arithmetic operators does not have space between them.Something like
Capture:
a = a+5
Does not capture:
a = a + 5
For this,I have written something like:
for i in array:
pattern = re.search(r"\S(\+|\-|\*|\\)\S",i)
\S : anything other than space
(+|-|*|\):mathematical operators
\S : anything other than space
But the problem is,its capturing the lines where post increment operators are used.
Captures :
a = a++
How could I write a regex expression such that it does not capture the line where post increment operators are used.
Q2.I want to capture where multi line comments are used in a file.
I tried by writing the below expression but it is failing to capture.I don't know where I have done wrong.Kindly help.
for i in array:
pattern = re.search(r"\/\*[A-Za-z0-9_]\*\/",i)

For question 1, you want a plus not preceded by a space or a plus and not followed by a space or a plus. This can be written
r"(?<![+\s])[+](?![+\s])"
You can do the same expression with minus instead of plus, or star, or slash. Then join these expressions with the | sign.
For question 2, you can try
r"[/][*](?:[^*]|[*](?![/]))*[*][/]"
Of course it won't handle nested multiline comments. For these, a simple regex won't suffice.

The first issue can be solved with the help of negated character classes, at least for the current example strings and maybe some more cases. The problem you showed is due to the fact that \S matches any non-whitespace character. To match any char that is not whitespace, -, /, + and *, and maybe even ( and ), use [^\s+*/()-] negated character class. Your first regex - note that division operator should be /, not \ - can be written as
pat = r"[^\s+*/()-]([+*/-])[^\s+*/()-]"
See the regex demo
The second one is a solved issue.
pat = r"/\*[^*]*\*+(?:[^/*][^*]*\*+)*/";
See the regex demo.
Details
/\* - comment start
[^*]*\*+ - match 0+ characters other than * followed with 1+ literal *
(?:[^/*][^*]*\*+)* - 0+ sequences of:
[^/*][^*]*\*+ - not a / or * (matched with [^/*]) followed with 0+ non-asterisk characters ([^*]*) followed with 1+ asterisks (\*+)
/ - closing /

Related

Regex that matches newlines literally and passively

I have to construct a regex that matches client codes that look like:
XXX/X{3,6}
XXX.X{3,6}
XXX.X{3,6}/XXX
With X a number between 0 and 9.
The regex needs to be strong enough so we don't extract codes that are within another string. The use of word boundaries was my first idea.
The regex looks like this: \b\d{3}[\.\/]\d{3,6}(?:\/\d{3})?\b
The problem with word boundaries is that it also matches dots. So a number like "123/456.12" would match "123/456" as the client number. So then I came up with the following regex: (?<!\S)\d{3}[\.\/]\d{3,6}(?:\/\d{3})?(?!\S). It uses lookbehind and lookahead and checks if that character is a white space. This matches most of the client codes correctly.
But there is still one last issue. We are using a Google OCR text to extract the codes from. This means that a valid code can be found in the text like 123/456\n, \n123/456, \n123/456\n, etc. Checking if the previous and or next characters are white space doesn't work because the literal "\n" is not included in this. If I do something like (?<!\S|\\n) as the word boundary it also includes a back and/or forward slash for some reason. Currently I came up with the following regex (?<![^\r\n\t\f\v n])\d{3}[\.\/]\d{3,6}(?:\/\d{3})?(?![^\r\n\t\f\v \\]), but that only checks if the previous character is a "n" or white space and the next a backslash or white space. So strings like "lorem\123/456" would still find a match. I need some way to include the "\n" in the white space characters without breaking the lookahead/lookbehind.
Do you guys have any idea how to solve this issue? All input is appreciated. Thx!

It seems you want to subtract \n from the whitespace boundaries. You can use
re.findall(r'(?<![^\s\n])\d{3}[./]\d{3,6}(?:/\d{3})?(?![^\s\n])', text)
See the Python demo and this regex demo.
If the \n are combinations of \ and n chars, you need to make sure the \S in the lookarounds does not match those:
import re
text = r'Codes like 123/456\n \n123/3456 \n123/23456\n etc are correct \n333.3333/333\n'
print( re.findall(r'(?<!\S(?<!\\n))\d{3}[./]\d{3,6}(?:/\d{3})?(?!(?!\\n)\S)', text) )
# => ['123/456', '123/3456', '123/23456', '333.3333/333']
See this Python demo.
Details:
(?<![^\s\n]) - a negative lookbehind that matches a location that is not immediately preceded with a char other than whitespace and an LF char
(?<!\S(?<!\\n)) - a left whitespace boundary that does not trigger if the non-whitespace is the n from the \n char combination
\d{3} - theree digits
[./] - a . or /
\d{3,6} - three to six digits
(?:/\d{3})? - an optional sequence of / and three digits
(?![^\s\n]) - a negative lookahead that requires no char other than whitespace and LF immediately to the right of the current location.
(?!(?!\\n)\S) - a right whitespace boundary that does not trigger if the non-whitespace is the \ char followed with n.

Regex to match following pattern in SQL query

I am trying to extract parts of a MySQL query to get the information I want.
I used this code / regex in Python:
import re
query = "SELECT `asd`.`ssss` as `column1`, `ss`.`wwwwwww` from `table`"
table_and_columns = re.findall('\`.*?`[.]\`.*?`',query)
My expected output:
['`asd`.`ssss`', `ss`.`wwwwwww`']
My real output:
['`asd`.`ssss`', '`column1`, `ss`.`wwwwwww`']
Can anybody help me and explain me where I went wrong?
The regex should only find the ones that have two strings like asd and a dot in the middle.
PS: I know that this is not a valid query.

The dot . can also match a backtick, so the pattern starts by matching a backtick and is able to match all chars until it reaches the literal dot in [.]
There is no need to use non greedy quantifiers, you can use a negated character class only prevent crossing the backtick boundary.
`[^`]*`\.`[^`]*`
Regex demo
The asterix * matches 0 or more times. If there has to be at least a single char, and newlines and spaces are unwanted, you could add \s to prevent matching whitespace chars and use + to match 1 or more times.
`[^`\s]+`\.`[^`\s]+`
Regex demo | Python demo
For example
import re
query = "SELECT `asd`.`ssss` as `column1`, `ss`.`wwwwwww` from `table`"
table_and_columns = re.findall('`[^`\s]+`\.`[^`\s]+`',query)
print(table_and_columns)
Output
['`asd`.`ssss`', '`ss`.`wwwwwww`']

Please try below regex. Greedy nature of .* from left to right is what caused issue.
Instead you should search for [^`]*
`[^`]*?`\.`[^`]*?`
Demo

The thing is that
.*? matches any character (except for line terminators) even whitespaces.
Also as you're already using * which means either 0 or unlimited occurrences,not sure you need to use ?.
So this seems to work:
\`\S+\`[.]\`\S+\`
where \S is any non-whitespace character.
You always can check you regexes using https://regex101.com

Trying to get a substring using regex in Python / pandas

I know this may seem stupid but I've been looking everywhere and trying with regex and split in vain. My script never works for all type of string I have on my data set.
I have this column that contains raw data that look like (three cases):
20181223-FB-BOOST-AAAA-CC Auchy-Les-Mines - Père Noel
20161224-FB-BOOST-SSSS-CC LeMarine - XXX XXX
20161223-FB-BOOST-XXXX-CC Bonjour le monde - Blah blah
So what I want to do is to get the strings in the middle after CC and right before "-". I wrote a script that did work for the 2nd case but never the other two :
1st case: Auchy-Les-Mines
2nd case: LeMarine
3rd case: Bonjour le monde
Here is the regex that I used but never works for all cases: regex = r"\s\b.*-."
Thanks in advance !

You my use
df['Col'].str.extract(r'-CC\s+(.*?)\s+-')
If there can be line breaks between the two delimiters, add the s/dotall flag or use [\w\W]/[\s\S]/[\d\D] instead of a .:
df['Col'].str.extract(r'(?s)-CC\s+(.*?)\s+-')
# ^^^^
df['Col'].str.extract(r'-CC\s+([\w\W]*?)\s+-')
# ^^^^^^
See the regex demo.
Pattern details
-CC - a literal substring
\s+ - 1+ whitespaces
(.*?) - Group 1 (this value will be returned by .str.extract): any 0+ chars other than newline, as few as possible
\s+ - 1+ whitespaces (+ is important here)
- - a hyphen
The fact that there are \s+ patterns on both ends of (.*?) will make sure the result is already stripped from whitespace regardless of how many whitespaces there were.

You can do it rather simple with:
result = df.raw_data.str.extract(r'-CC (.*) -')

Why does python's re.search method hang?

I'm using python regex library to parse some strings and currently I found that my regex is either too complicated or the string I'm searching is too long.
Here's an example of the hang up:
>>> import re
>>> reg = "(\w+'?\s*)+[-|~]\s*((\d+\.?\d+\$?)|(\$?\d+\.?\d+))"
>>> re.search(reg, "**LOOKING FOR PAYPAL OFFERS ON THESE PAINTED UNCOMMONS**") #Hangs here...
I'm not sure what's going on. Any help appreciated!
EDIT: Here's a link with examples of what I'm trying to match: Regxr

The reason why the code execution hangs is catastrophic backtracking due to one obligatory and 1+ optional patterns (those that can match an empty string) inside a quantified group (\w+'?\s*)+ that allows a regex engine to test a lot of matching paths, so many that it takes too long to complete.
I suggest unwrapping the problematic group in such a way that ' or \s become obligatory and wrap them in an optional group:
(\w+(?:['\s]+\w+)*)\s*[-~]\s*(\$?\d+(?:\.\d+)?\$?)
^^^^^^^^^^^^^^^^^^^***
See the regex demo
Here, (\w+(?:['\s]+\w+)*) will match 1+ word chars, and then 0+ sequences of 1+ ' or whitespaces followed with 1+ word chars. This way, the pattern becomes linear and the regex engine fails the match quicker if a non-matching string occurs.
The rest of the pattern:
\s*[-~]\s* - either - or ~ wrapped with 0+ whitespaces
(\$?\d+(?:\.\d+)?\$?) - Group 2 capturing
\$? - 1 or 0 $ symbols
\d+ - 1+ digits
(?:\.\d+)? - 1 or 0 zero sequences of:
\. - a dot
\d+ - 1+ digits
\$? - 1 or 0 $ symbols

Commenting Regular expressions in python

This answer to a question regarding the maintainability of regular expressions mentions the ability of .NET users to implement comments in their regular expressions (I am particularly interested in the second example)
Is there an easy native way to reproduce this in python, preferably without having to install a third party library or writing my own comment-strip algorithm?
what I currently do is similar to the first example in that answer, I concatenate the regular expression in multiple lines and comment each line, like in the following example:
regexString = '(?:' # Non-capturing group matching the beginning of a comment
regexString += '/\*\*'
regexString += ')'

You're looking for the VERBOSE flag in the re module. Example from its documentation:
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)

r"""
(?: # Match the regular expression below
/ # Match the character “/” literally
\* # Match the character “*” literally
\* # Match the character “*” literally
)
"""
You can also add comments into regex like this:
(?#The following regex matches /** in a non-capture group :D)(?:/\*\*)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex (Lookaround) - python

Related

Regex that matches newlines literally and passively

Regex to match following pattern in SQL query

Trying to get a substring using regex in Python / pandas

Why does python's re.search method hang?

Commenting Regular expressions in python

Categories

Resources