Splitting strings separated by multiple possible characters? - python

...note that values will be delimited by one or more space or TAB characters
How can I use the split() method if there are multiple separating characters of different types, as in this case?

by default split can handle multiple types of white space, not sure if it's enough for what you need but try it:
>>> s = "a \tb c\t\t\td"
>>> s.split()
['a', 'b', 'c', 'd']
It certainly works for multiple spaces and tabs mixed.

Split using regular expressions and not just one separator:
http://docs.python.org/2/library/re.html

I had the same problem with some strings separated by different whitespace chars, and used \s as shown in the Regular Expressions library specification.
\s matches any whitespace character, this is equivalent to the set [ \t\n\r\f\v].
you will need to import re as the regular expression handler:
import re
line = "something separated\t by \t\t\t different \t things"
workstr = re.sub('\s+','\t',line)
So, any whitespace or separator (\s) repeated one or more times (+) is transformed to a single tabulation (\t), that you can reprocess with split('\t')
workstr = "something`\t`separated`\t`by`\t`different`\t`things"
newline = workstr.split('\t')
newline = ['something','separated','by','different','things']

Do a text substitution first then split.
e.g. replace all tabs with spaces, then split on space.

You can use regular expressions first:
import re
re.sub('\s+', ' ', 'text with whitespace etc').split()
['text', 'with', 'whitespace', 'etc']

For whitespace delimeters, str.split() already does what you may want. From the Python Standard Library,
str.split([sep[, maxsplit]])
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
For example, ' 1 2 3 '.split() returns ['1', '2', '3'], and ' 1 2 3 '.split(None, 1) returns ['1', '2 3 '].

Related

Why does split() return more elements than split(" ") on same string?

I am using split() and split(" ") on the same string. But why is split(" ") returning less number of elements than split()? I want to know in what specific input case this would happen.
str.split with the None argument (or, no argument) splits on all whitespace characters, and this isn't limited to just the space you type in using your spacebar.
In [457]: text = 'this\nshould\rhelp\tyou\funderstand'
In [458]: text.split()
Out[458]: ['this', 'should', 'help', 'you', 'understand']
In [459]: text.split(' ')
Out[459]: ['this\nshould\rhelp\tyou\x0cunderstand']
List of all whitespace characters that split(None) splits on can be found at All the Whitespace Characters? Is it language independent?
If you run the help command on the split() function you'll see this:
split(...) S.split([sep [,maxsplit]]) -> list of strings
Return a list of the words in the string S, using sep as the delimiter
string. If maxsplit is given, at most maxsplit splits are done. If sep
is not specified or is None, any whitespace string is a separator and
empty strings are removed from the result.
Therefore the difference between the to is that split() without specifing the delimiter will delete the empty strings while the one with the delimiter won't.
The method str.split called without arguments has a somewhat different behaviour.
First it splits by any whitespace character.
'foo bar\nbaz\tmeh'.split() # ['foo', 'bar', 'baz', 'meh']
But it also remove the empty strings from the output list.
' foo bar '.split(' ') # ['', 'foo', 'bar', '']
' foo bar '.split() # ['foo', 'bar']
In Python, the split function splits on a specific string if specified, otherwise on spaces (and then you can access the result list by index as usual):
s = "Hello world! How are you?"
s.split()
Out[9]:['Hello', 'world!', 'How', 'are', 'you?']
s.split("!")
Out[10]: ['Hello world', ' How are you?']
s.split("!")[0]
Out[11]: 'Hello world'
From my own experience, the most confusion had come from split()'s different treatments on whitespace.
Having a separator like ' ' vs None, triggers different behavior of split(). According to the Python documentation.
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace.
Below is an example, in which the sample string has a trailing space ' ', which is the same whitespace as the one passed in the second split(). Hence, this method behaves differently, not because of some whitespace character mismatch, but it's more of how this method was designed to work, maybe for convenience in common scenarios, but it can also be confusing for people who expect the split() to just split.
sample = "a b "
sample.split()
>>> ['a', 'b']
sample.split(' ')
>>> ['a', 'b', '']

Split and replace a string with another string of same length in python

I want to split the given string into letters,digits and special characters.
After splitting these have to be replaced by another letters,digits and special characters respectively.
e.g. abc123wer#xyz.com is the given string.Then
Splitted output: ['abc','123','wer','#','xyz','.','com']
Replacing should happen from a file which contains some letters, digits and special characters.
Replacement output: ['xyz','231','etr','$','pou','#','fin']
One option to split the string is to use regex with re module to match letters [a-zA-Z]+, digits [0-9]+ and non-alphanumeric [^a-zA-Z0-9]+ respectively:
import re
re.findall("[a-zA-Z]+|[0-9]+|[^a-zA-Z0-9]+", s)
# ['abc', '123', 'wer', '#', 'xyz', '.', 'com']

Python split with multiple delimiters not working

I have a string:
feature.append(freq_and_feature(text, freq))
I want a list containing each word of the string, like [feature, append, freq, and, feature, text, freq], where each word is a string, of course.
These string are contained in a file called helper.txt, so I'm doing the following, as suggested by multiple SO posts, like the accepted answer for this one(Python: Split string with multiple delimiters):
import re
with open("helper.txt", "r") as helper:
for row in helper:
print re.split('\' .,()_', row)
However, I get the following, which is not what I want.
[' feature.append(freq_pain_feature(text, freq))\n']
re.split('\' .,()_', row)
This looks for the string ' .,()_ to split on. You probably meant
re.split('[\' .,()_]', row)
re.split takes a regular expression as the first argument. To say "this OR that" in regular expressions, you can write a|b and it will match either a or b. If you wrote ab, it would only match a followed by b. Luckily, so we don't have to write '| |.|,|(|..., there's a nice form where you can use []s to state that everything inside should be treated as "match one of these".
It seems you want to split a string with non-word or underscore characters. Use
import re
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[\W_]+', s) if x])
# => ['feature', 'append', 'freq', 'and', 'feature', 'text', 'freq']
See the IDEONE demo
The [\W_]+ regex matches 1+ characters that are not word (\W = [^a-zA-Z0-9_]) or underscores.
You can get rid of the if x if you remove initial and trailing non-word characters from the input string, e.g. re.sub(r'^[\W_]+|[\W_]+$', '', s).
You can try this
str = re.split('[.(_,)]+', row, flags=re.IGNORECASE)
str.pop()
print str
This will result:
['feature', 'append', 'freq', 'and', 'feature', 'text', ' freq']
I think you are trying to split on the basis of non-word characters. It should be
re.split(r'[^A-Za-z0-9]+', s)
[^A-Za-z0-9] can be translated to --> [\W_]
Python Code
s = 'feature.append(freq_and_feature(text, freq))'
print([x for x in re.split(r'[^A-Za-z0-9]+', s) if x])
This will also work, indeed
p = re.compile(r'[^\W_]+')
test_str = "feature.append(freq_and_feature(text, freq))"
print(re.findall(p, test_str))
Ideone Demo

split string in python when characters on either side of separator are not numbers

I have a large list of chemical data, that contains entries like the following:
1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc
I have a function that is correctly splitting the 1st entry into:
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).
Is there an easy pythonic way to do this?
I explain a little bit based on #eph's answer:
import re
data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)
re.split(pattern, string) will split string by the occurrences of regex pattern.
(plz read Regex Quick Start if you are not familiar with regex.)
The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:
The middle | is the OR operator.
\D matches a single character that is not a digit.
\s matches a whitespace character (includes tabs and line breaks).
, matches character ",".
* attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
(?<= ... ) and (?= ...) are the lookbebind and lookahead assertions.
For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.
Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.
Some useful tools for regex:
Regex Cheat Sheet
Online regex tester: regex101 (with a tree structure explanation to your regex)
Use regex and lookbehind/lookahead assertion
>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']

How do I remove hex values in a python string with regular expressions?

I have a cell array in matlab
columns = {'MagX', 'MagY', 'MagZ', ...
'AccelerationX', 'AccelerationX', 'AccelerationX', ...
'AngularRateX', 'AngularRateX', 'AngularRateX', ...
'Temperature'}
I use these scripts which make use of matlab's hdf5write function to save the array in the hdf5 format.
I then read in the the hdf5 file into python using pytables. The cell array comes in as a numpy array of strings. I convert to a list and this is the output:
>>>columns
['MagX\x00\x00\x00\x08\x01\x008\xe6\x7f',
'MagY\x00\x7f\x00\x00\x00\xee\x0b9\xe6\x7f',
'MagZ\x00\x00\x00\x00\x001',
'AccelerationX',
'AccelerationY',
'AccelerationZ',
'AngularRateX',
'AngularRateY',
'AngularRateZ',
'Temperature']
These hex values pop into the strings from somewhere and I'd like to remove them. They don't always appear on the first three items of the list and I need a nice way to deal with them or to find out why they are there in the first place.
>>>print columns[0]
Mag8�
>>>columns[0]
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>repr(columns[0])
"'MagX\\x00\\x00\\x00\\x08\\x01\\x008\\xe6\\x7f'"
>>>print repr(columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
I've tried using a regular expression to remove the hex values but have little luck.
>>>re.sub('(\w*)\\\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('(\w*)\\\\x.*', r'\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub(r'(\w*)\\x.*', '\1', columns[0])
'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f'
>>>re.sub('([A-Za-z]*)\x00', r'\1', columns[0])
'MagX\x08\x018\xe6\x7f'
>>>re.sub('(\w*?)', '\1', columns[0])
'\x01M\x01a\x01g\x01X\x01\x00\x01\x00\x01\x00\x01\x08\x01\x01\x01\x00\x018\x01\xe6\x01\x7f\x01'
Any suggestions on how to deal with this?
You can remove all non-word characters in the following way:
>>> re.sub(r'[^\w]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'
The regex [^\w] will match any character that is not a letter, digit, or underscore. By providing that regex in re.sub with an empty string as a replacement you will delete all other characters in the string.
Since there may be other characters you want to keep, a better solution might be to specify a larger range of characters that you want to keep that excludes control characters. For example:
>>> re.sub(r'[^\x20-\x7e]', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX8'
Or you could replace [^\x20-\x7e] with the equivalent [^ -~], depending on which seems more clear to you.
To exclude all characters after this first control character just add a .*, like this:
>>> re.sub(r'[^ -~].*', '', 'MagX\x00\x00\x00\x08\x01\x008\xe6\x7f')
'MagX'
They're not actually in the string: you have unescaped control characters, which Python displays using the hexadecimal notation - that's why you see a unusual symbol when you print the value.
You should simply be able to remove the extra levels of quoting in your regular expression but you might also simply rely on something like the regexp module's generic whitespace class, which will match whitespace characters other than tabs and spaces:
>>> import re
>>> re.sub(r'\s', '?', "foo\x00bar")
'foo\x00bar'
>>> print re.sub(r'\s', '?', "foo\x00bar")
foobar
I use this one a bit to replace all input whitespace runs, including non-breaking space characters, with a single space:
>>> re.sub(r'[\xa0\s]+', ' ', input_str)
You can also do this without importing re. E.g. if you're content to keep only ascii characters:
good_string = ''.join(c if ord(c) < 129 else '?' for c in bad_string)

Categories