Python "split" function on repeated characters - python

I have gone through many threads on Stackoverflow about using Split function on strings, but still am unclear about the following output:
"aaaaa".split("a")
output: ['', '', '', '', '', '']
"baaaaa".split("a")
output: ['b', '', '', '', '', '']
Can someone please explain how repeated characters are treated by "split" function?

The empty strings are not due to the fact that you have repeated characters but that in your string you have only the delimiter (the string by which you are splitting the target string). The output of str.split doesn't include the delimiter.
From the docs:
str.split(sep=None, maxsplit=-1)
[...]
If sep is given, consecutive delimiters are not grouped together and
are deemed to delimit empty strings (for example, '1,,2'.split(',')
returns ['1', '', '2']). The sep argument may consist of multiple
characters (for example, '1<>2<>3'.split('<>') returns ['1', '2', '3']). Splitting an empty string with a specified separator returns
[''].
Other way to see it:
Separate your string by the delimiter; you will get:
"aaaaa" --> ['a', 'a', 'a', 'a', 'a']
Exclude the delimiter from your strings and you will get:
['', '', '', '', '']
In this manner, you will get your output similar to your second string.

Related

How to find the length of the longest consecutive repetition of a certain repeating word in a string?

I'm trying to write a function that identifies how many times a certain word is repeated in its longest consecutive repetition.
I want the below function to print "5" because the word "hi" repeats 5 times in its most repetitive sequence inside of the string. How can I accomplish this?
import re
string = 'hihihihihibyebyebyehihihihibyebyebyehihi'
print(len(max(re.compile("(hi+hi)*").findall(string))))
Output: 4
I would recommend starting with the regex part. It probably isn't doing what you think it's doing.
Out of curiosity, I ran just a portion of the last line:
re.compile("(hi+hi)*").findall(string)
and the result was:
['hihi', '', '', '', '', '', '', '', '', '', '', '', 'hihi', '', '', '', '', '', '', '', '', '', 'hihi', '']
I can now see why the output was 4: the longest string in this list is 4 characters long.
This unexpected result brings up a few questions:
What patterns is the regex "(hi+hi)*" actually matching, and why?
What do all those empty strings mean?
The longest match was "hihi" which is only 2 hi's, but the output was 4. Why?
Try taking a closer look at the documentation for regex, I think you'll find that the expression you are looking for is "(?:hi)+" which roughly means ~at least one repetition of "hi"~
IIUC, you can use:
max(map(len, re.findall('(?:hi)+', string)))//len('hi')
Output: 5
Alternative, if you have a potentially variable length of the match. Capture but the unit chunk and the total match with repeats:
string = 'hxihxihxihxihxibyebyebyehihihihibyebyebyehihi'
max(len(a)//len(b) for a,b in re.findall('((hx?i)+)', string))
# 5

Why is there whitespace when using .split() on string with the split term in consecutive order?

I noticed that when I did "heelo".split("e"), it would return ['h', '', 'lo']. Why is there an empty/a whitespace item in the list? Shouldn't it have been ['h', 'lo']?
I am confused on why I received that result, instead of what I had expected and would appreciate if someone could explain me the functionality of split better.
From the Python docs:
If sep is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings (for example, '1,,2'.split(',') returns ['1', '', '2'])
Your string is divided between the first e and second e, but there is no character there, so you get an empty character back ''
It takes into account the first 'e' separates the 'h' but the letter adjacent to it is also an 'e', but there is no letter between the first and second 'e' so you get an empty string.
If we add one more 'e':
"heeelo".split("e")
['h', '', '', 'lo']
It returns two empty strings between the three 'e's.

Python Regex to check for characters problem, what am I doing wrong? [duplicate]

This question already has an answer here:
Extract salaries from a list of strings
(1 answer)
Closed 2 years ago.
Being new to Regex, I am working on a project that allows me to check if a password contains lowercase characters, uppercase and numerical ones.
Here is the code:
text = "azeAZE123"
compilealpha = re.compile(r'[a-z]*')
compileAlpha = re.compile(r'[A-Z]*')
compilenum = re.compile(r'[0-9]*')
checkalpha = compilealpha.findall(text)
checkAlpha = compileAlpha.findall(text)
checknum = compilenum.findall(text)
print(checkAlpha)
print(checkalpha)
print(checknum)
What I do not understand is that I get an output like this one:
['', '', '', 'AZE', '', '', '', '']
['aze', '', '', '', '', '', '', '']
['', '', '', '', '', '', '123', '']
Could anyone explain to me what happened and what am I doing wrong please?
Your regular expressions specify the quantity specifier *, 0 or more matches. When combined with findall(), your regular expression matches empty substrings as well.
If you want to check if the regular expression has one or more matches, use r'[A-Z]+' instead.
Since you might be more interested in whether you have a match, and less interested in what that match is, you might consider using the regexp search() function instead of findall(), which will evaluate to a boolean.

Match only non-quoted words using a regex in python

While trying to process some code, I needed to find instances in which variables from a certain list were used. Problem is, the code is obfuscated and those variable names could also appear in a string, for example, which I didn't want to match.
However, I haven't been able to find a regex to match only non-quoted words that works in python...
"[^\\\\]((\")|('))(?(2)([^\"]|\\\")*|([^']|\\')*)[^\\\\]\\1|(\w+)"
Should match any non-quoted word to the last group (6th group, index 5 with 0-based indexing). Minor modifications are required to avoid matching strings which begin with quotes.
Explanation:
[^\\\\] Match any character but an escape character. Escaped quotes do not start a string.
((\")|(')) Immediately after the non-escaped character, match either " or ', which starts a string. This is group 1, which contains groups 2 (\") and 3 (')
(?(2) if we matched group 2 (a double-quote)
([^\"]|\\\")*| match anything but double quotes, or match escaped double quotes. Otherwise:
([^']|\\')*) match anything but a single quote or match an escaped single quote.
If you wish to retrieve the string inside the quotes, you will have to add another group: (([^\"]|\\\")*) will allow you to retrieve the whole consumed string, rather than just the last matched character.
Note that the last character of a quoted string will actually be consumed by the last [^\\\\]. To retrieve it, you have to turn it into a group: ([^\\\\]). Additionally, The first character before the quote will also be consumed by [^\\\\], which might be meaningful in cases such as r"Raw\text".
[^\\\\]\\1 will match any non-escape character followed by what the first group matched again. That is, if ((\")|(')) matched a double quote, we requite a double quote to end the string. Otherwise, it matched a single quote, which is what we require to end the string.
|(\w+) will match any word. This will only match if non-quoted strings, as quoted strings will be consumed by the previous regex.
For example:
import re
non_quoted_words = "[^\\\\]((\")|('))(?(2)([^\"]|\\\")*|([^']|\\')*)[^\\\\]\\1|(\w+)"
quote = "This \"is an example ' \\\" of \" some 'text \\\" like wtf' \\\" is what I said."
print(quote)
print(re.findall(non_quoted_words,quote))
will return:
This "is an example ' \" of " some 'text \" like wtf' \" is what I said.
[('', '', '', '', '', 'This'), ('"', '"', '', 'f', '', ''), ('', '', '', '', '', 'some'), ("'", '', "'", '', 't', ''), ('', '', '', '', '', 'is'), ('', '', '', '', '', 'what'), ('', '', '', '', '', 'I'), ('', '', '', '', '', 'said')]

is python str.split() inconsistent?

>>> ".a string".split('.')
['', 'a string']
>>> "a .string".split('.')
['a ', 'string']
>>> "a string.".split('.')
['a string', '']
>>> "a ... string".split('.')
['a ', '', '', ' string']
>>> "a ..string".split('.')
['a ', '', 'string']
>>> 'this is a test'.split(' ')
['this', '', 'is', 'a', 'test']
>>> 'this is a test'.split()
['this', 'is', 'a', 'test']
Why is split() different from split(' ') when the invoked string only have spaces as whitespaces?
Why split('.') splits "..." to ['','']? split() does not consider an empty word between 2 separators...
The docs are clear about this (see #agf below), but I'd like to know why is this the chosen behaviour.
I have looked in the source code (here) and thought line 136 should be just less than: ...i < str_len...
See the str.split docs, this behavior is specifically mentioned:
If sep is given, consecutive delimiters are not grouped together and
are deemed to delimit empty strings (for example, '1,,2'.split(',')
returns ['1', '', '2']). The sep argument may consist of multiple
characters (for example, '1<>2<>3'.split('<>') returns ['1', '2',
'3']). Splitting an empty string with a specified separator returns
[''].
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace
with a None separator returns [].
Python tries to do what you would expect. Most people not thinking too hard would probably expect
'1 2 3 4 '.split()
to return
['1', '2', '3', '4']
Think about splitting data where spaces have been used instead of tabs to create fixed-width columns -- if the data is different widths, there will be different number of spaces in each row.
There is often trailing whitespace at the end of a line that you can't see, and the default ignores it as well -- it gives you the answer you'd visually expect.
When it comes to the algorithm used when a delimiter is specified, think about a row in a CSV file:
1,,3
means there is data in the 1st and 3rd columns, and none in the second, so you would want
'1,,3'.split(',')
to return
['1', '', '3']
otherwise you wouldn't be able to tell what column each string came from.

Categories