is python str.split() inconsistent? - python

>>> ".a string".split('.')
['', 'a string']
>>> "a .string".split('.')
['a ', 'string']
>>> "a string.".split('.')
['a string', '']
>>> "a ... string".split('.')
['a ', '', '', ' string']
>>> "a ..string".split('.')
['a ', '', 'string']
>>> 'this is a test'.split(' ')
['this', '', 'is', 'a', 'test']
>>> 'this is a test'.split()
['this', 'is', 'a', 'test']
Why is split() different from split(' ') when the invoked string only have spaces as whitespaces?
Why split('.') splits "..." to ['','']? split() does not consider an empty word between 2 separators...
The docs are clear about this (see #agf below), but I'd like to know why is this the chosen behaviour.
I have looked in the source code (here) and thought line 136 should be just less than: ...i < str_len...

See the str.split docs, this behavior is specifically mentioned:
If sep is given, consecutive delimiters are not grouped together and
are deemed to delimit empty strings (for example, '1,,2'.split(',')
returns ['1', '', '2']). The sep argument may consist of multiple
characters (for example, '1<>2<>3'.split('<>') returns ['1', '2',
'3']). Splitting an empty string with a specified separator returns
[''].
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace
with a None separator returns [].
Python tries to do what you would expect. Most people not thinking too hard would probably expect
'1 2 3 4 '.split()
to return
['1', '2', '3', '4']
Think about splitting data where spaces have been used instead of tabs to create fixed-width columns -- if the data is different widths, there will be different number of spaces in each row.
There is often trailing whitespace at the end of a line that you can't see, and the default ignores it as well -- it gives you the answer you'd visually expect.
When it comes to the algorithm used when a delimiter is specified, think about a row in a CSV file:
1,,3
means there is data in the 1st and 3rd columns, and none in the second, so you would want
'1,,3'.split(',')
to return
['1', '', '3']
otherwise you wouldn't be able to tell what column each string came from.

Related

Why is there whitespace when using .split() on string with the split term in consecutive order?

I noticed that when I did "heelo".split("e"), it would return ['h', '', 'lo']. Why is there an empty/a whitespace item in the list? Shouldn't it have been ['h', 'lo']?
I am confused on why I received that result, instead of what I had expected and would appreciate if someone could explain me the functionality of split better.
From the Python docs:
If sep is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings (for example, '1,,2'.split(',') returns ['1', '', '2'])
Your string is divided between the first e and second e, but there is no character there, so you get an empty character back ''
It takes into account the first 'e' separates the 'h' but the letter adjacent to it is also an 'e', but there is no letter between the first and second 'e' so you get an empty string.
If we add one more 'e':
"heeelo".split("e")
['h', '', '', 'lo']
It returns two empty strings between the three 'e's.

Python "split" function on repeated characters

I have gone through many threads on Stackoverflow about using Split function on strings, but still am unclear about the following output:
"aaaaa".split("a")
output: ['', '', '', '', '', '']
"baaaaa".split("a")
output: ['b', '', '', '', '', '']
Can someone please explain how repeated characters are treated by "split" function?
The empty strings are not due to the fact that you have repeated characters but that in your string you have only the delimiter (the string by which you are splitting the target string). The output of str.split doesn't include the delimiter.
From the docs:
str.split(sep=None, maxsplit=-1)
[...]
If sep is given, consecutive delimiters are not grouped together and
are deemed to delimit empty strings (for example, '1,,2'.split(',')
returns ['1', '', '2']). The sep argument may consist of multiple
characters (for example, '1<>2<>3'.split('<>') returns ['1', '2', '3']). Splitting an empty string with a specified separator returns
[''].
Other way to see it:
Separate your string by the delimiter; you will get:
"aaaaa" --> ['a', 'a', 'a', 'a', 'a']
Exclude the delimiter from your strings and you will get:
['', '', '', '', '']
In this manner, you will get your output similar to your second string.

Why does split() return more elements than split(" ") on same string?

I am using split() and split(" ") on the same string. But why is split(" ") returning less number of elements than split()? I want to know in what specific input case this would happen.
str.split with the None argument (or, no argument) splits on all whitespace characters, and this isn't limited to just the space you type in using your spacebar.
In [457]: text = 'this\nshould\rhelp\tyou\funderstand'
In [458]: text.split()
Out[458]: ['this', 'should', 'help', 'you', 'understand']
In [459]: text.split(' ')
Out[459]: ['this\nshould\rhelp\tyou\x0cunderstand']
List of all whitespace characters that split(None) splits on can be found at All the Whitespace Characters? Is it language independent?
If you run the help command on the split() function you'll see this:
split(...) S.split([sep [,maxsplit]]) -> list of strings
Return a list of the words in the string S, using sep as the delimiter
string. If maxsplit is given, at most maxsplit splits are done. If sep
is not specified or is None, any whitespace string is a separator and
empty strings are removed from the result.
Therefore the difference between the to is that split() without specifing the delimiter will delete the empty strings while the one with the delimiter won't.
The method str.split called without arguments has a somewhat different behaviour.
First it splits by any whitespace character.
'foo bar\nbaz\tmeh'.split() # ['foo', 'bar', 'baz', 'meh']
But it also remove the empty strings from the output list.
' foo bar '.split(' ') # ['', 'foo', 'bar', '']
' foo bar '.split() # ['foo', 'bar']
In Python, the split function splits on a specific string if specified, otherwise on spaces (and then you can access the result list by index as usual):
s = "Hello world! How are you?"
s.split()
Out[9]:['Hello', 'world!', 'How', 'are', 'you?']
s.split("!")
Out[10]: ['Hello world', ' How are you?']
s.split("!")[0]
Out[11]: 'Hello world'
From my own experience, the most confusion had come from split()'s different treatments on whitespace.
Having a separator like ' ' vs None, triggers different behavior of split(). According to the Python documentation.
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
Below is an example, in which the sample string has a trailing space ' ', which is the same whitespace as the one passed in the second split(). Hence, this method behaves differently, not because of some whitespace character mismatch, but it's more of how this method was designed to work, maybe for convenience in common scenarios, but it can also be confusing for people who expect the split() to just split.
sample = "a b "
sample.split()
>>> ['a', 'b']
sample.split(' ')
>>> ['a', 'b', '']

Reading in file with different number of spaces as delimiter?

I'm trying to read in a file but it's looking really awkward because each of the spaces between columns is different. This is what I have so far:
with open('sextractordata1488.csv') as f:
#getting rid of title, aka unusable lines:
for _ in xrange(15):
next(f)
for line in f:
cols = line.split(' ')
#9 because it's 9 spaces before the first column with real data
print cols[10]
I looked up how to do this and saw tr and sed commands that gave syntax errors when I attempted them, plus I wasn't really sure where in the code to put them (in the for loop or before it?). I want to reduce all the spaces between columns to one space so that I can consistently get the one column without issues (at the moment because it's a counter column from 1 to 101 I only get 10 through 99 and a bunch of spaces and parts from other columns in between because 1 and 101 have a different number of characters, and thus a different number of spaces from the beginning of the line).
Just use str.split() without an argument. The string is then split on arbitrary width whitespace. That means it doesn't matter how many spaces there are between non-whitespace content anymore:
>>> ' this is rather \t\t hard to parse without\thelp\n'.split()
['this', 'is', 'rather', 'hard', 'to', 'parse', 'without', 'help']
Note that leading and trailing whitespace are removed as well. Tabs, spaces, newlines, and carriage returns are all considered whitespace.
For completeness sake, the first argument can also be set to None for the same effect. This is helpful to know when you need to limit the split with the second argument:
>>> ' this is rather \t\t hard to parse without\thelp\n'.split(None)
['this', 'is', 'rather', 'hard', 'to', 'parse', 'without', 'help']
>>> ' this is rather \t\t hard to parse without\thelp\n'.split(None, 3)
['this', 'is', 'rather', 'hard to parse without\thelp\n']
cols = line.split() should be sufficient
>> "a b".split()
['a', 'b']

Splitting strings separated by multiple possible characters?

...note that values will be delimited by one or more space or TAB characters
How can I use the split() method if there are multiple separating characters of different types, as in this case?
by default split can handle multiple types of white space, not sure if it's enough for what you need but try it:
>>> s = "a \tb c\t\t\td"
>>> s.split()
['a', 'b', 'c', 'd']
It certainly works for multiple spaces and tabs mixed.
Split using regular expressions and not just one separator:
http://docs.python.org/2/library/re.html
I had the same problem with some strings separated by different whitespace chars, and used \s as shown in the Regular Expressions library specification.
\s matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v].
you will need to import re as the regular expression handler:
import re
line = "something separated\t by \t\t\t different \t things"
workstr = re.sub('\s+','\t',line)
So, any whitespace or separator (\s) repeated one or more times (+) is transformed to a single tabulation (\t), that you can reprocess with split('\t')
workstr = "something`\t`separated`\t`by`\t`different`\t`things"
newline = workstr.split('\t')
newline = ['something','separated','by','different','things']
Do a text substitution first then split.
e.g. replace all tabs with spaces, then split on space.
You can use regular expressions first:
import re
re.sub('\s+', ' ', 'text with whitespace etc').split()
['text', 'with', 'whitespace', 'etc']
For whitespace delimeters, str.split() already does what you may want. From the Python Standard Library,
str.split([sep[, maxsplit]])
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
For example, ' 1 2 3 '.split() returns ['1', '2', '3'], and ' 1 2 3 '.split(None, 1) returns ['1', '2 3 '].

Categories