Split a string at uppercase letters - python

What is the pythonic way to split a string before the occurrences of a given set of characters?
For example, I want to split
'TheLongAndWindingRoad'
at any occurrence of an uppercase letter (possibly except the first), and obtain
['The', 'Long', 'And', 'Winding', 'Road'].
Edit: It should also split single occurrences, i.e.
from 'ABC' I'd like to obtain
['A', 'B', 'C'].

Unfortunately it's not possible to split on a zero-width match in Python. But you can use re.findall instead:
>>> import re
>>> re.findall('[A-Z][^A-Z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']
>>> re.findall('[A-Z][^A-Z]*', 'ABC')
['A', 'B', 'C']

Here is an alternative regex solution. The problem can be reprased as "how do I insert a space before each uppercase letter, before doing the split":
>>> s = "TheLongAndWindingRoad ABC A123B45"
>>> re.sub( r"([A-Z])", r" \1", s).split()
['The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']
This has the advantage of preserving all non-whitespace characters, which most other solutions do not.

Use a lookahead and a lookbehind:
In Python 3.7, you can do this:
re.split('(?<=.)(?=[A-Z])', 'TheLongAndWindingRoad')
And it yields:
['The', 'Long', 'And', 'Winding', 'Road']
You need the look-behind to avoid an empty string at the beginning.

>>> import re
>>> re.findall('[A-Z][a-z]*', 'TheLongAndWindingRoad')
['The', 'Long', 'And', 'Winding', 'Road']
>>> re.findall('[A-Z][a-z]*', 'SplitAString')
['Split', 'A', 'String']
>>> re.findall('[A-Z][a-z]*', 'ABC')
['A', 'B', 'C']
If you want "It'sATest" to split to ["It's", 'A', 'Test'] change the rexeg to "[A-Z][a-z']*"

A variation on #ChristopheD 's solution
s = 'TheLongAndWindingRoad'
pos = [i for i,e in enumerate(s+'A') if e.isupper()]
parts = [s[pos[j]:pos[j+1]] for j in xrange(len(pos)-1)]
print parts

I think that a better answer might be to split the string up into words that do not end in a capital. This would handle the case where the string doesn't start with a capital letter.
re.findall('.[^A-Z]*', 'aboutTheLongAndWindingRoad')
example:
>>> import re
>>> re.findall('.[^A-Z]*', 'aboutTheLongAndWindingRoadABC')
['about', 'The', 'Long', 'And', 'Winding', 'Road', 'A', 'B', 'C']

Pythonic way could be:
"".join([(" "+i if i.isupper() else i) for i in 'TheLongAndWindingRoad']).strip().split()
['The', 'Long', 'And', 'Winding', 'Road']
Works good for Unicode, avoiding re/re2.
"".join([(" "+i if i.isupper() else i) for i in 'СуперМаркетыПродажаКлиент']).strip().split()
['Супер', 'Маркеты', 'Продажа', 'Клиент']

import re
filter(None, re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad"))
or
[s for s in re.split("([A-Z][^A-Z]*)", "TheLongAndWindingRoad") if s]

src = 'TheLongAndWindingRoad'
glue = ' '
result = ''.join(glue + x if x.isupper() else x for x in src).strip(glue).split(glue)

Another without regex and the ability to keep contiguous uppercase if wanted
def split_on_uppercase(s, keep_contiguous=False):
"""
Args:
s (str): string
keep_contiguous (bool): flag to indicate we want to
keep contiguous uppercase chars together
Returns:
"""
string_length = len(s)
is_lower_around = (lambda: s[i-1].islower() or
string_length > (i + 1) and s[i + 1].islower())
start = 0
parts = []
for i in range(1, string_length):
if s[i].isupper() and (not keep_contiguous or is_lower_around()):
parts.append(s[start: i])
start = i
parts.append(s[start:])
return parts
>>> split_on_uppercase('theLongWindingRoad')
['the', 'Long', 'Winding', 'Road']
>>> split_on_uppercase('TheLongWindingRoad')
['The', 'Long', 'Winding', 'Road']
>>> split_on_uppercase('TheLongWINDINGRoadT', True)
['The', 'Long', 'WINDING', 'Road', 'T']
>>> split_on_uppercase('ABC')
['A', 'B', 'C']
>>> split_on_uppercase('ABCD', True)
['ABCD']
>>> split_on_uppercase('')
['']
>>> split_on_uppercase('hello world')
['hello world']

Alternative solution (if you dislike explicit regexes):
s = 'TheLongAndWindingRoad'
pos = [i for i,e in enumerate(s) if e.isupper()]
parts = []
for j in xrange(len(pos)):
try:
parts.append(s[pos[j]:pos[j+1]])
except IndexError:
parts.append(s[pos[j]:])
print parts

Replace every uppercase letter 'L' in the given with an empty space plus that letter " L". We can do this using list comprehension or we can define a function to do it as follows.
s = 'TheLongANDWindingRoad ABC A123B45'
''.join([char if (char.islower() or not char.isalpha()) else ' '+char for char in list(s)]).strip().split()
>>> ['The', 'Long', 'A', 'N', 'D', 'Winding', 'Road', 'A', 'B', 'C', 'A123', 'B45']
If you choose to go by a function, here is how.
def splitAtUpperCase(text):
result = ""
for char in text:
if char.isupper():
result += " " + char
else:
result += char
return result.split()
In the case of the given example:
print(splitAtUpperCase('TheLongAndWindingRoad'))
>>>['The', 'Long', 'A', 'N', 'D', 'Winding', 'Road']
But most of the time that we are splitting a sentence at upper case letters, it is usually the case that we want to maintain abbreviations that are typically a continuous stream of uppercase letters. The code below would help.
def splitAtUpperCase(s):
for i in range(len(s)-1)[::-1]:
if s[i].isupper() and s[i+1].islower():
s = s[:i]+' '+s[i:]
if s[i].isupper() and s[i-1].islower():
s = s[:i]+' '+s[i:]
return s.split()
splitAtUpperCase('TheLongANDWindingRoad')
>>> ['The', 'Long', 'AND', 'Winding', 'Road']
Thanks.

An alternative way without using regex or enumerate:
word = 'TheLongAndWindingRoad'
list = [x for x in word]
for char in list:
if char != list[0] and char.isupper():
list[list.index(char)] = ' ' + char
fin_list = ''.join(list).split(' ')
I think it is clearer and simpler without chaining too many methods or using a long list comprehension that can be difficult to read.

This is possible with the more_itertools.split_before tool.
import more_itertools as mit
iterable = "TheLongAndWindingRoad"
[ "".join(i) for i in mit.split_before(iterable, pred=lambda s: s.isupper())]
# ['The', 'Long', 'And', 'Winding', 'Road']
It should also split single occurrences, i.e. from 'ABC' I'd like to obtain ['A', 'B', 'C'].
iterable = "ABC"
[ "".join(i) for i in mit.split_before(iterable, pred=lambda s: s.isupper())]
# ['A', 'B', 'C']
more_itertools is a third-party package with 60+ useful tools including implementations for all of the original itertools recipes, which obviates their manual implementation.

An alternate way using enumerate and isupper()
Code:
strs = 'TheLongAndWindingRoad'
ind =0
count =0
new_lst=[]
for index, val in enumerate(strs[1:],1):
if val.isupper():
new_lst.append(strs[ind:index])
ind=index
if ind<len(strs):
new_lst.append(strs[ind:])
print new_lst
Output:
['The', 'Long', 'And', 'Winding', 'Road']

Sharing what came to mind when I read the post. Different from other posts.
strs = 'TheLongAndWindingRoad'
# grab index of uppercase letters in strs
start_idx = [i for i,j in enumerate(strs) if j.isupper()]
# create empty list
strs_list = []
# initiate counter
cnt = 1
for pos in start_idx:
start_pos = pos
# use counter to grab next positional element and overlook IndexeError
try:
end_pos = start_idx[cnt]
except IndexError:
continue
# append to empty list
strs_list.append(strs[start_pos:end_pos])
cnt += 1

You might also wanna do it this way
def camelcase(s):
words = []
for char in s:
if char.isupper():
words.append(':'+char)
else:
words.append(char)
words = ((''.join(words)).split(':'))
return len(words)
This will output as follows
s = 'oneTwoThree'
print(camecase(s)
//['one', 'Two', 'Three']

def solution(s):
st = ''
for c in s:
if c == c.upper():
st += ' '
st += c
return st

I'm using list
def split_by_upper(x):
i = 0
lis = list(x)
while True:
if i == len(lis)-1:
if lis[i].isupper():
lis.insert(i,",")
break
if lis[i].isupper() and i != 0:
lis.insert(i,",")
i+=1
i+=1
return "".join(lis).split(",")
OUTPUT:
data = "TheLongAndWindingRoad"
print(split_by_upper(data))`
>> ['The', 'Long', 'And', 'Winding', 'Road']

My solution for splitting on capitalized letters - keeps capitalized words
text = 'theLongAndWindingRoad ABC'
result = re.sub('(?<=.)(?=[A-Z][a-z])', r" ", text).split()
print(result)
#['the', 'Long', 'And', 'Winding', 'Road', 'ABC']

Little late in the party, but:
In [1]: camel = "CamelCaseConfig"
In [2]: parts = "".join([
f"|{c}" if c.isupper() else c
for c in camel
]).lstrip("|").split("|")
In [3]: screaming_snake = "_".join([
part.upper()
for part in parts
])
In [4]: screaming_snake
Out[4]: 'CAMEL_CASE_CONFIG'
part of my answer is based on other people answer from here

def split_string_after_upper_case(word):
word_lst = [x for x in word]
index = 0
for char in word[1:]:
index += 1
if char.isupper():
word_lst.insert(index, ' ')
index += 1
return ''.join(word_lst).split(" ")
k = split_string_after_upper_case('TheLongAndWindingRoad')
print(k)

Related

Splitting a string and retaining the delimiter with the delimiter appearing contiguously

I have the following string:
bar = 'F9B2Z1F8B30Z4'
I have a function foo that splits the string on F, then adds back the F delimiter.
def foo(my_str):
res = ['F' + elem for elem in my_str.split('F') if elem != '']
return res
This works unless there are two "F"s back-to-back in the string. For example,
foo('FF9B2Z1F8B30Z4')
returns
['F9B2Z1', 'F8B30Z4']
(the double "F" at the start of the string is not processed)
I'd like the function to split on the first "F" and add it to the list, as follows:
['F', 'F9B2Z1', 'F8B30Z4']
If there is a double "F" in the middle of the string, then the desired behavior would be:
foo('F9B2Z1FF8B30Z4')
['F9B2Z1', 'F', 'F8B30Z4']
Any help would be greatly appreciated.
Instead of the filtering if, use slicing instead because an empty string is a problem only at the beginning:
def foo(my_str):
res = ['F' + elem for elem in my_str.split('F')]
return res[1:] if my_str and my_str[0]=='F' else res
Output:
>>> foo('FF9B2Z1F8B30Z4')
['F', 'F9B2Z1', 'F8B30Z4']
>>> foo('FF9B2Z1FF8B30Z4FF')
['F', 'F9B2Z1', 'F', 'F8B30Z4', 'F', 'F']
>>> foo('9B2Z1F8B30Z4')
['F9B2Z1', 'F8B30Z4']
>>> foo('')
['F']
Using regex it can be done with
import re
pattern = r'^[^F]+|(?<=F)[^F]*'
The ^[^F]+ captures all characters at the beginning of strings that do not start with F.
(?<=F)[^F]* captures anything following an F so long as it is not an F character including empty matches.
>>> print(['F' + x for x in re.findall(pattern, 'abcFFFAFF')])
['Fabc', 'F', 'F', 'FA', 'F', 'F']
>>> print(['F' + x for x in re.findall(pattern, 'FFabcFA')])
['F', 'Fabc', 'FA']
>>> print(['F' + x for x in re.findall(pattern, 'abc')])
['Fabc']
Note that this returns nothing for empty strings. If empty strings need to return ['F'] then pattern can be changed to pattern = r'^[^F]+|(?<=F)[^F]*|^$' adding ^$ to capture empty strings.

how to combine two text in one array

vowels=["a","e","i","o","u"]
word=["a","ir","the","book"]
for i in word:
if len(i) == 1 and i in vowels:
....
I just make this loop to detect 'a' in 'vowels' array and have only one space
like 'a' = 1 space , 'ant' = 3 spaces.
I want to combine 'a' with the following word in array.
how can I combine 'a' with 'ir' to make 'word' array to be
word=["air","the","book"]
You can use an iterator to construct the new list, just add the next() item if the current item is in vowels, e.g.:
>>> vowels = ['a', 'e', 'i', 'o', 'u']
>>> words = ["a", "ir", "the", "book"]
>>> iterable = iter(word)
>>> [i+next(iterable, '') if i in vowels else i for i in iterable]
['air', 'the', 'book']
Or if you happen to need to add multiple vowels:
def concat(i, iterable):
return i + concat(next(iterable, ''), iterable) if i in vowels else i
>>> words = ['a', 'i', 'r']
>>> iterable = iter(words)
>>> [concat(i, iterable) for i in iterable]
['air']
>>> vowels = {'a', 'e', 'i', 'o', 'u'} # used set for efficient `in` operation
>>> words = ['a', 'ir', 'the', 'book']
>>> for i in range(len(words) - 2, -1, -1):
... if words[i] in vowels:
... # len(words[i]) == 1 is redundant (all vowels items' length = 1)
... words[i] += words.pop(i + 1) # merge vowel with next word
...
>>> words
['air', 'the', 'book']
NOTE: iterated backward; otherwise indexes are broken as soon as list.pop called.

String Splitting on upper case but retain upper case substrings

Say I have:
myString = 'myPERLPythonJavaScriptJavaTextSample'
I would like to split this as:
['my', 'PERL', 'Python', 'Java', 'Script', 'Java', 'Text', 'Sample']
What is/are the PYTHONIC way(s) of doing this?
I should have been clearer :-(. Here is another example of what I am after:
myString2 = ['myAbcDEFGhijklMNOP']
should return:
['my', 'Abc', 'DEF', 'Ghijkl', 'MNOP']
'...DEFGh...' becomes '....', 'DEF', 'Gh...' because 'G' is the last character of the string of upper cases 'DEFG'. That is, we split at the penultimate upper case letter, if there are more than one successive upper case letters. This does not apply for the last substring: return 'MNOP' as is.
Use a regular expression to separate words with spaces then split:
import re
myString = 'myPERLPythonJavaScriptJavaTextSample'
myString = re.sub(r'((?<=[a-z])[A-Z]|(?<!\A)[A-Z](?=[a-z]))', r' \1', myString)
result = myString.split()
print result
returns: ['my', 'PERL', 'Python', 'Java', 'Script', 'Java', 'Text', 'Sample']
As stated in comments you can't get this exactly but you can come close and post-process it:
myString = 'myPERLPythonJavaScriptJavaTextSample'
ll = []
val = ''
for ch in myString:
if ch.isupper():
ll.append(val)
val = ''
val += ch
print ll
>> ['my', 'P', 'E', 'R', 'L', 'Python', 'Java', 'Script', 'Java', 'Text']
Try using regular expressions:
import re
myString = 'myPERLPythonJavaScriptJavaTextSample'
regex = '([a-z]+)(?=[A-Z])|([A-Z][a-z]+)'
ll = filter(None, re.split(regex, myString))
print ll
Which returns:
['my', 'PERL', 'Python', 'Java', 'Script', 'Java', 'Text', 'Sample']
enter link description here

Split a string in Python having parenthesis (multiple splitters)

I have a string, for example:
"ab(abcds)kadf(sd)k(afsd)(lbne)"
I want to split it to a list such that the list is stored like this:
a
b
abcds
k
a
d
f
sd
k
afsd
lbne
I need to get the elements outside the parenthesis in separate rows and the ones inside it in separate ones.
I am not able to think of any solution to this problem.
You can use iter to make an iterator and use itertools.takewhile to extract the strings between the parens:
it = iter(s)
from itertools import takewhile
print([ch if ch != "(" else "".join(takewhile(lambda x: x!= ")",it)) for ch in it])
['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
If ch is not equal to ( we just take the char else if ch is a ( we use takewhile which will keep taking chars until we hit a ) .
Or using re.findall get all strings starting and ending in () with \((.+?))` and all other characters with :
print([''.join(tup) for tup in re.findall(r'\((.+?)\)|(\w)', s)])
['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
You just need to use the magic of 're.split' and some logic.
import re
string = "ab(abcds)kadf(sd)k(afsd)(lbne)"
temp = []
x = re.split(r'[(]',string)
#x = ['ab', 'abcds)kadf', 'sd)k', 'afsd)', 'lbne)']
for i in x:
if ')' not in i:
temp.extend(list(i))
else:
t = re.split(r'[)]',i)
temp.append(t[0])
temp.extend(list(t[1]))
print temp
#temp = ['a', 'b', 'abcds', 'k', 'a', 'd', 'f', 'sd', 'k', 'afsd', 'lbne']
Have a look at difference in append and extend here.
I hope this helps.
You have two options. The really easy one is to just iterate over the string. For example:
in_parens=False
buffer=''
for char in my_string:
if char =='(':
in_parens=True
elif char==')':
in_parens = False
my_list.append(buffer)
buffer=''
elif in_parens:
buffer+=char
else:
my_list.append(char)
The other option is regex.
I would suggest regex. It is worth practicing.
Try: Python re. If you are new to re it may take a bit of time but you can do all kind of string manipulations once you get it.
import re
search_string = 'ab(abcds)kadf(sd)k(afsd)(lbne)'
re_pattern = re.compile('(\w)|\((\w*)\)') # Match single character or characters in parenthesis
print [x if x else y for x,y in re_pattern.findall(search_string)]

How to search for each elements of a list in a string in python

let's say
there's a list
vowels = ['a', 'e', 'i', 'o', 'u']
x = raw_input("Enter something?")
Now how to find instances of these vowels in the x? I want to modify x so that it contains only non vowel letters.
.find won't work.
vowels = {'a', 'e', 'i', 'o', 'u'}
x =input('Enter...')
new_string = ''.join(c for c in x if c not in vowels)
Will create a new copy of x minus the vowels saved as new_string. I have changed vowels to be a set so that look up time is faster (somewhat trivial in this example but it's a good habit to sue where appropriate). Strings are immutable so you can't just take the letters out of x, you have to create a new string that is a copy of x without the values you don't need. .join() puts the whole thing back together.
You can use the count function for each letter. For example x.count('a') would count how many 'a' are in the word. The iterate over all the vowels and use sum to find the total number of vowels.
vowelCount = sum(x.count(vowel) for vowel in vowels)
from collections import Counter
vowels = {'a', 'e', 'i', 'o', 'u'}
s = "foobar"
print(sum(v for k,v in Counter(s).items() if k in vowels))
3
Or use dict.get with a default value of 0:
s = "foobar"
c = Counter(s)
print(sum(c.get(k,0) for k in vowels))
3
You can use like this,
>>> st = 'test test'
>>> len(re.findall('[aeiou]', st, re.IGNORECASE))
2
Or,
>>> vowels = ['a', 'e', 'i', 'o', 'u']
>>> sum(map(lambda x: vowels.count(x) if x in vowels else 0, st))
2
Or,
>>> len([ ch for ch in st if ch in vowels])
2

Categories