Match only non-quoted words using a regex in python - python

While trying to process some code, I needed to find instances in which variables from a certain list were used. Problem is, the code is obfuscated and those variable names could also appear in a string, for example, which I didn't want to match.
However, I haven't been able to find a regex to match only non-quoted words that works in python...

"[^\\\\]((\")|('))(?(2)([^\"]|\\\")*|([^']|\\')*)[^\\\\]\\1|(\w+)"
Should match any non-quoted word to the last group (6th group, index 5 with 0-based indexing). Minor modifications are required to avoid matching strings which begin with quotes.
Explanation:
[^\\\\] Match any character but an escape character. Escaped quotes do not start a string.
((\")|(')) Immediately after the non-escaped character, match either " or ', which starts a string. This is group 1, which contains groups 2 (\") and 3 (')
(?(2) if we matched group 2 (a double-quote)
([^\"]|\\\")*| match anything but double quotes, or match escaped double quotes. Otherwise:
([^']|\\')*) match anything but a single quote or match an escaped single quote.
If you wish to retrieve the string inside the quotes, you will have to add another group: (([^\"]|\\\")*) will allow you to retrieve the whole consumed string, rather than just the last matched character.
Note that the last character of a quoted string will actually be consumed by the last [^\\\\]. To retrieve it, you have to turn it into a group: ([^\\\\]). Additionally, The first character before the quote will also be consumed by [^\\\\], which might be meaningful in cases such as r"Raw\text".
[^\\\\]\\1 will match any non-escape character followed by what the first group matched again. That is, if ((\")|(')) matched a double quote, we requite a double quote to end the string. Otherwise, it matched a single quote, which is what we require to end the string.
|(\w+) will match any word. This will only match if non-quoted strings, as quoted strings will be consumed by the previous regex.
For example:
import re
non_quoted_words = "[^\\\\]((\")|('))(?(2)([^\"]|\\\")*|([^']|\\')*)[^\\\\]\\1|(\w+)"
quote = "This \"is an example ' \\\" of \" some 'text \\\" like wtf' \\\" is what I said."
print(quote)
print(re.findall(non_quoted_words,quote))
will return:
This "is an example ' \" of " some 'text \" like wtf' \" is what I said.
[('', '', '', '', '', 'This'), ('"', '"', '', 'f', '', ''), ('', '', '', '', '', 'some'), ("'", '', "'", '', 't', ''), ('', '', '', '', '', 'is'), ('', '', '', '', '', 'what'), ('', '', '', '', '', 'I'), ('', '', '', '', '', 'said')]

Related

How to find the length of the longest consecutive repetition of a certain repeating word in a string?

I'm trying to write a function that identifies how many times a certain word is repeated in its longest consecutive repetition.
I want the below function to print "5" because the word "hi" repeats 5 times in its most repetitive sequence inside of the string. How can I accomplish this?
import re
string = 'hihihihihibyebyebyehihihihibyebyebyehihi'
print(len(max(re.compile("(hi+hi)*").findall(string))))
Output: 4
I would recommend starting with the regex part. It probably isn't doing what you think it's doing.
Out of curiosity, I ran just a portion of the last line:
re.compile("(hi+hi)*").findall(string)
and the result was:
['hihi', '', '', '', '', '', '', '', '', '', '', '', 'hihi', '', '', '', '', '', '', '', '', '', 'hihi', '']
I can now see why the output was 4: the longest string in this list is 4 characters long.
This unexpected result brings up a few questions:
What patterns is the regex "(hi+hi)*" actually matching, and why?
What do all those empty strings mean?
The longest match was "hihi" which is only 2 hi's, but the output was 4. Why?
Try taking a closer look at the documentation for regex, I think you'll find that the expression you are looking for is "(?:hi)+" which roughly means ~at least one repetition of "hi"~
IIUC, you can use:
max(map(len, re.findall('(?:hi)+', string)))//len('hi')
Output: 5
Alternative, if you have a potentially variable length of the match. Capture but the unit chunk and the total match with repeats:
string = 'hxihxihxihxihxibyebyebyehihihihibyebyebyehihi'
max(len(a)//len(b) for a,b in re.findall('((hx?i)+)', string))
# 5

Python "split" function on repeated characters

I have gone through many threads on Stackoverflow about using Split function on strings, but still am unclear about the following output:
"aaaaa".split("a")
output: ['', '', '', '', '', '']
"baaaaa".split("a")
output: ['b', '', '', '', '', '']
Can someone please explain how repeated characters are treated by "split" function?
The empty strings are not due to the fact that you have repeated characters but that in your string you have only the delimiter (the string by which you are splitting the target string). The output of str.split doesn't include the delimiter.
From the docs:
str.split(sep=None, maxsplit=-1)
[...]
If sep is given, consecutive delimiters are not grouped together and
are deemed to delimit empty strings (for example, '1,,2'.split(',')
returns ['1', '', '2']). The sep argument may consist of multiple
characters (for example, '1<>2<>3'.split('<>') returns ['1', '2', '3']). Splitting an empty string with a specified separator returns
[''].
Other way to see it:
Separate your string by the delimiter; you will get:
"aaaaa" --> ['a', 'a', 'a', 'a', 'a']
Exclude the delimiter from your strings and you will get:
['', '', '', '', '']
In this manner, you will get your output similar to your second string.

non-capturing parenthesis with lookbehind and lookahead - Python

So I want to capture the indices in a string like this:
"Something bad happened! # data[u'string_1'][u'string_2']['u2'][0]"
I want to capture the strings string_1, string_2, u2, and 0.
I was able to do this using the following regex:
re.findall("("
"((?<=\[u')|(?<=\['))" # Begins with [u' or ['
"[a-zA-Z0-9_\-]+" # Followed by any letters, numbers, _'s, or -'s
"(?='\])" # Ending with ']
")"
"|" # OR
"("
"(?<=\[)" # Begins with [
"[0-9]+" # Followed by any numbers
"(?=\])" # Endging with ]
")", message)
Problem is the result will include tuples with empty strings, as such:
[('string_1', '', ''), ('string_2', '', ''), ('u2', '', ''), ('', '', '0')]
Now I can easily filter out the empty strings from the result, but I would like to prevent them from appearing in the first place.
I believe that the reason for this is due to my capture groups. I tried to use ?: in those group, but then my results were completely gone.
This is how I had attempted to do it:
re.findall("(?:"
"((?<=\[u')|(?<=\['))" # Begins with [u' or ['
"[a-zA-Z0-9_\-]+" # Followed by any letters, numbers, _'s, or -'s
"(?='\])" # Ending with ']
")"
"|" # OR
"(?:"
"(?<=\[)" # Begins with [
"[0-9]+" # Followed by any numbers
"(?=\])" # Endging with ]
")", message)
That results in the following output:
['', '', '', '']
I'm assuming the issue is due to me using lookbehinds along with the non-capturing groups. Any ideas on whether this is possible to do in Python?
Thanks
You can simplify your regex.
(?<=\[)u?'?([a-zA-Z0-9_\-]+)(?='?\])
See demo .
https://regex101.com/r/SA6shx/1
Regex: (?<=\[)(?:[^'\]]*')?([^'\]]+) or \[(?:[^'\]]*')?([^'\]]+)
Python code:
def Years(text):
return re.findall(r'(?<=\[)(?:[^\'\]]*\')?([^\'\]]+)', text)
print(Years('Something bad happened! # data[u\'string_1\'][u\'string_2\'][\'u2\'][0]'))
Output:
['string_1', 'string_2', 'u2', '0']

Find pattern in string using regex with python 3

I have string like below
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
I want to get invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967 into list using regex with this pattern
result = re.findall(r'INV[/]\d{8}[/](M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/](M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/]\d{7,9}',string)
but the result is
[('XVII', '', '','', '', '', '', '', 'X', 'VII', '', '', '', 'V','','','', '', '', '', '', '', '', '', '', 'V')]
I tried this pattern in http://regexr.com/, the result is appropriately but in python not
You should modify your pattern, add normal brackets around whole regular expression, and afterwards access that text with first back-reference. You can read more about back-references here.
invoices = []
# Your pattern was slightly incorrect
pattern = re.compile(r'IVR[/]\d{8}[/](M{1,4}(CM|CD|D?C{0,3})|(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/](M{1,4}(CM|CD|D?C{0,3})|(XC|XL|L?X{0,3})|(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/]\d{7,9}')
# For each invoice pattern you find in string, append it to list
for invoice in pattern.finditer(string):
invoices.append(invoice.group(1))
Note:
You should also use pattern.finditter() because that way you can iterate trough all pattern findings in text you called string. From re.finditer documentation:
re.finditer(pattern, string, flags=0)
Return an iterator yielding
MatchObject instances over all non-overlapping matches for the RE
pattern in string. The string is scanned left-to-right, and matches
are returned in the order found. Empty matches are included in the
result unless they touch the beginning of another match.
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
results = []
matches = re.finditer(regexpattern, string)
for matchNum, match in enumerate(matches):
results.append(match.group())
You need to add ?: before all the groups so that you can use non-capturing groups
Try with this regex:
IVR[/]\d{8}[/](?:M{0,4}(?:CM|CD|D?C{0,3})|(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))[/](?:M{0,4}(?:CM|CD|D?C{0,3})|(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))[/]\d{8}
Basically you need to add ?: for each group.
You can try this one to retrieve number, roman, roman and number values:
IVR\/(\d{8})\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(\d{7,9})
Demo
Snippet
import re
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
pattern = r"IVR\/(\d{8})\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(\d{7,9})"
for match in re.findall(pattern, string):
print(match)
Run online

is re.split("\W") = re.split("\w")?

As the title says, is re.split("\W") the same as re.split("\w") because the results I get are the same whichever I use. The same goes if it has + or not. Is this right? Or it works in some cases, and if yes why? Thank you in advance.
They are not the same thing at all:
>>> test_string = 'hello world'
>>> import re
>>> re.split('\w', test_string)
['', '', '', '', '', ' ', '', '', '', '', '']
>>> re.split('\W', test_string)
['hello', 'world']
re.split does the following:
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.
\w and \W are:
\w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_].
With LOCALE, it will match the set [0-9_] plus characters defined
as letters for the current locale.
\W Matches the complement of \w.

Categories