I'm trying to write a function that identifies how many times a certain word is repeated in its longest consecutive repetition.
I want the below function to print "5" because the word "hi" repeats 5 times in its most repetitive sequence inside of the string. How can I accomplish this?
import re
string = 'hihihihihibyebyebyehihihihibyebyebyehihi'
print(len(max(re.compile("(hi+hi)*").findall(string))))
Output: 4
I would recommend starting with the regex part. It probably isn't doing what you think it's doing.
Out of curiosity, I ran just a portion of the last line:
re.compile("(hi+hi)*").findall(string)
and the result was:
['hihi', '', '', '', '', '', '', '', '', '', '', '', 'hihi', '', '', '', '', '', '', '', '', '', 'hihi', '']
I can now see why the output was 4: the longest string in this list is 4 characters long.
This unexpected result brings up a few questions:
What patterns is the regex "(hi+hi)*" actually matching, and why?
What do all those empty strings mean?
The longest match was "hihi" which is only 2 hi's, but the output was 4. Why?
Try taking a closer look at the documentation for regex, I think you'll find that the expression you are looking for is "(?:hi)+" which roughly means ~at least one repetition of "hi"~
IIUC, you can use:
max(map(len, re.findall('(?:hi)+', string)))//len('hi')
Output: 5
Alternative, if you have a potentially variable length of the match. Capture but the unit chunk and the total match with repeats:
string = 'hxihxihxihxihxibyebyebyehihihihibyebyebyehihi'
max(len(a)//len(b) for a,b in re.findall('((hx?i)+)', string))
# 5
I have gone through many threads on Stackoverflow about using Split function on strings, but still am unclear about the following output:
"aaaaa".split("a")
output: ['', '', '', '', '', '']
"baaaaa".split("a")
output: ['b', '', '', '', '', '']
Can someone please explain how repeated characters are treated by "split" function?
The empty strings are not due to the fact that you have repeated characters but that in your string you have only the delimiter (the string by which you are splitting the target string). The output of str.split doesn't include the delimiter.
From the docs:
str.split(sep=None, maxsplit=-1)
[...]
If sep is given, consecutive delimiters are not grouped together and
are deemed to delimit empty strings (for example, '1,,2'.split(',')
returns ['1', '', '2']). The sep argument may consist of multiple
characters (for example, '1<>2<>3'.split('<>') returns ['1', '2', '3']). Splitting an empty string with a specified separator returns
[''].
Other way to see it:
Separate your string by the delimiter; you will get:
"aaaaa" --> ['a', 'a', 'a', 'a', 'a']
Exclude the delimiter from your strings and you will get:
['', '', '', '', '']
In this manner, you will get your output similar to your second string.
So I want to capture the indices in a string like this:
"Something bad happened! # data[u'string_1'][u'string_2']['u2'][0]"
I want to capture the strings string_1, string_2, u2, and 0.
I was able to do this using the following regex:
re.findall("("
"((?<=\[u')|(?<=\['))" # Begins with [u' or ['
"[a-zA-Z0-9_\-]+" # Followed by any letters, numbers, _'s, or -'s
"(?='\])" # Ending with ']
")"
"|" # OR
"("
"(?<=\[)" # Begins with [
"[0-9]+" # Followed by any numbers
"(?=\])" # Endging with ]
")", message)
Problem is the result will include tuples with empty strings, as such:
[('string_1', '', ''), ('string_2', '', ''), ('u2', '', ''), ('', '', '0')]
Now I can easily filter out the empty strings from the result, but I would like to prevent them from appearing in the first place.
I believe that the reason for this is due to my capture groups. I tried to use ?: in those group, but then my results were completely gone.
This is how I had attempted to do it:
re.findall("(?:"
"((?<=\[u')|(?<=\['))" # Begins with [u' or ['
"[a-zA-Z0-9_\-]+" # Followed by any letters, numbers, _'s, or -'s
"(?='\])" # Ending with ']
")"
"|" # OR
"(?:"
"(?<=\[)" # Begins with [
"[0-9]+" # Followed by any numbers
"(?=\])" # Endging with ]
")", message)
That results in the following output:
['', '', '', '']
I'm assuming the issue is due to me using lookbehinds along with the non-capturing groups. Any ideas on whether this is possible to do in Python?
Thanks
You can simplify your regex.
(?<=\[)u?'?([a-zA-Z0-9_\-]+)(?='?\])
See demo .
https://regex101.com/r/SA6shx/1
Regex: (?<=\[)(?:[^'\]]*')?([^'\]]+) or \[(?:[^'\]]*')?([^'\]]+)
Python code:
def Years(text):
return re.findall(r'(?<=\[)(?:[^\'\]]*\')?([^\'\]]+)', text)
print(Years('Something bad happened! # data[u\'string_1\'][u\'string_2\'][\'u2\'][0]'))
Output:
['string_1', 'string_2', 'u2', '0']
I have string like below
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
I want to get invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967 into list using regex with this pattern
result = re.findall(r'INV[/]\d{8}[/](M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/](M{1,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/]\d{7,9}',string)
but the result is
[('XVII', '', '','', '', '', '', '', 'X', 'VII', '', '', '', 'V','','','', '', '', '', '', '', '', '', '', 'V')]
I tried this pattern in http://regexr.com/, the result is appropriately but in python not
You should modify your pattern, add normal brackets around whole regular expression, and afterwards access that text with first back-reference. You can read more about back-references here.
invoices = []
# Your pattern was slightly incorrect
pattern = re.compile(r'IVR[/]\d{8}[/](M{1,4}(CM|CD|D?C{0,3})|(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/](M{1,4}(CM|CD|D?C{0,3})|(XC|XL|L?X{0,3})|(IX|IV|V?I{0,3})|M{0,4}(CM|C?D|D?C{1,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|X?L|L?X{1,3})(IX|IV|V?I{0,3})|M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|I?V|V?I{1,3}))[/]\d{7,9}')
# For each invoice pattern you find in string, append it to list
for invoice in pattern.finditer(string):
invoices.append(invoice.group(1))
Note:
You should also use pattern.finditter() because that way you can iterate trough all pattern findings in text you called string. From re.finditer documentation:
re.finditer(pattern, string, flags=0)
Return an iterator yielding
MatchObject instances over all non-overlapping matches for the RE
pattern in string. The string is scanned left-to-right, and matches
are returned in the order found. Empty matches are included in the
result unless they touch the beginning of another match.
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
results = []
matches = re.finditer(regexpattern, string)
for matchNum, match in enumerate(matches):
results.append(match.group())
You need to add ?: before all the groups so that you can use non-capturing groups
Try with this regex:
IVR[/]\d{8}[/](?:M{0,4}(?:CM|CD|D?C{0,3})|(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))[/](?:M{0,4}(?:CM|CD|D?C{0,3})|(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))[/]\d{8}
Basically you need to add ?: for each group.
You can try this one to retrieve number, roman, roman and number values:
IVR\/(\d{8})\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(\d{7,9})
Demo
Snippet
import re
string = "your invoice number IVR/20170531/XVII/V/12652967 and IVR/20170531/XVII/V/13652967"
pattern = r"IVR\/(\d{8})\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}))\/(\d{7,9})"
for match in re.findall(pattern, string):
print(match)
Run online
As the title says, is re.split("\W") the same as re.split("\w") because the results I get are the same whichever I use. The same goes if it has + or not. Is this right? Or it works in some cases, and if yes why? Thank you in advance.
They are not the same thing at all:
>>> test_string = 'hello world'
>>> import re
>>> re.split('\w', test_string)
['', '', '', '', '', ' ', '', '', '', '', '']
>>> re.split('\W', test_string)
['hello', 'world']
re.split does the following:
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.
\w and \W are:
\w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_].
With LOCALE, it will match the set [0-9_] plus characters defined
as letters for the current locale.
\W Matches the complement of \w.