How to split string everywhere a letter appears? - python

I have a string containing letters and numbers like this -
12345A6789B12345C
How can I get a list that looks like this
[12345A, 6789B, 12345C]

>>> my_string = '12345A6789B12345C'
>>> import re
>>> re.findall('\d*\w', my_string)
['12345A', '6789B', '12345C']

For the sake of completeness, non-regex solution:
data = "12345A6789B12345C"
result = [""]
for char in data:
result[-1] += char
if char.isalpha():
result.append("")
if not result[-1]:
result.pop()
print(result)
# ['12345A', '6789B', '12345C']
Should be faster for smaller strings, but if you're working with huge data go with regex as once compiled and warmed up, the search separation happens on the 'fast' C side.

You could build this with a generator, too. The approach below keeps track of start and end indices of each slice, yielding a generator of strings. You'll have to cast it to list to use it as one, though (splitonalpha(some_string)[-1] will fail, since generators aren't indexable)
def splitonalpha(s):
start = 0
for end, ch in enumerate(s, start=1):
if ch.isalpha:
yield s[start:end]
start = end
list(splitonalpha("12345A6789B12345C"))
# ['12345A', '6789B', '12345C']

Related

Compare String Indices in Python

Coming from other languages, I know how to compare string indices to test for equality. But in Python I get the following error when trying to compare indices within a string.
TypeError: string indices must be integers
How can the indices of a string be compared for equality?
program.py
myString = 'aabccddde'
for i in myString:
if myString[i] == myString[i + 1]:
print('match')
Python for loops are sometimes known as "foreach" loops in other languages. You are not looping over indices, you are looping over the characters of the string. Strings are iterable and produce their characters when iterated over.
You may find this answer and part of this answer helpful to understand how exactly a for loop in Python works.
Regarding your actual problem, the itertools documentation has a recipe for this called pairwise. You can either copy-paste the function or import it from more_itertools (which needs to be installed).
Demo:
>>> # copy recipe from itertools docs or import from more_itertools
>>> from more_itertools import pairwise
>>> myString = 'aabccddde'
>>>
>>> for char, next_char in pairwise(myString):
... if char == next_char:
... print(char, next_char, 'match')
...
a a match
c c match
d d match
d d match
In my opinion, we should avoid explicitly using indices when iterating whenever possible. Python has high level abstractions that allow you to not get sidetracked by the indices most of the time. In addition, lots of things are iterable and can't even be indexed into using integers. The above code works for any iterable being passed to pairwise, not just strings.
In this phrase:
for i in myString:
You are iterating over single letters. So myString[i] means 'aabccddde'["a"] in the first iteration, what is of course not possible.
If you would like to save order of letters and letters, you may use "enumerate" or "len":
myString = 'aabccddde'
for i in range(len(myString)-2):
if myString[i] == myString[i + 1]:
print('match')
You can use enumerate:
myString = 'aabccddde'
l = len(myString)
for i,j in enumerate(myString):
if i == l-1: # This if block will prevent the error message for last index
break
if myString[i] == myString[i + 1]:
print('match')
Enumerate will help iterate over each character and the index of the character in the string:
myString = 'aabccddde'
for idx, char in enumerate(myString, ):
# guard against IndexError
if idx+1 == len(myString):
break
if char == myString[idx + 1]:
print('match')

Remove punctuation items from end of string

I have a seemingly simple problem, which I cannot seem to solve. Given a string containing a DOI, I need to remove the last character if it is a punctuation mark until the last character is letter or number.
For example, if the string was:
sampleDoi = "10.1097/JHM-D-18-00044.',"
I want the following output:
"10.1097/JHM-D-18-00044"
ie. remove .',
I wrote the following script to do this:
invalidChars = set(string.punctuation.replace("_", ""))
a = "10.1097/JHM-D-18-00044.',"
i = -1
for each in reversed(a):
if any(char in invalidChars for char in each):
a = a[:i]
i = i - 1
else:
print (a)
break
However, this produces 10.1097/JHM-D-18-00 but I would like it to produce 10.1097/JHM-D-18-00044. Why is the 44 removed from the end?
The string function rstrip() is designed to do exactly this:
>>> sampleDoi = "10.1097/JHM-D-18-00044.',"
>>> sampleDoi.rstrip(",.'")
'10.1097/JHM-D-18-00044'
Corrected code:
import string
invalidChars = set(string.punctuation.replace("_", ""))
a = "10.1097/JHM-D-18-00044.',"
i = -1
for each in reversed(a):
if any(char in invalidChars for char in each):
a = a[:i]
i = i # Well Really this line can just be removed all together.
else:
print (a)
break
This gives the output you want, while keeping the original code mostly the same.
This is one way using next and str.isalnum with a generator expression utilizing enumerate / reversed.
sampleDoi = "10.1097/JHM-D-18-00044.',"
idx = next((i for i, j in enumerate(reversed(sampleDoi)) if j.isalnum()), 0)
res = sampleDoi[:-idx]
print(res)
'10.1097/JHM-D-18-00044'
The default parameter 0is used so that, if no alphanumeric character is found, an empty string is returned.
If you dont wanna use regex:
the_str = "10.1097/JHM-D-18-00044.',"
while the_str[-1] in string.punctuation:
the_str = the_str[:-1]
Removes the last character until it's no longer a punctuation character.

Remove duplicates from a string so that it appears once

Given a string which contains only lowercase letters, remove duplicate letters so that every letter appear once and only once. I must make sure your result is the smallest in lexicographical order among all possible results.
def removeDuplicates(str):
dict = {}
word = []
for i in xrange(len(str)):
if str[i] not in word:
word.append(str[i])
dict[str[i]] = i
else:
word.remove(str[i])
word.append(str[i])
dict[str[i]] = i
ind = dict.values()
# Second scan
for i in xrange(len(str)):
if str.index(str[i]) in ind:
continue
temp = dict[str[i]]
dict[str[i]] = i
lst = sorted(dict.keys(),key = lambda d:dict[d])
if ''.join(lst) < ''.join(word):
word = lst
else:
dict[str[i]] = temp
return ''.join(word)
I am not getting the desired result
print removeDuplicateLetters("cbacdcbc")
Input:
"cbacdcbc"
Output:
"abcd"
Expected:
"acdb"
Use a set. A set is a data structure similar to a list, but it removes all duplicates. You can instantiate a set by doing set(), or setting a variable to a set by using curly brackets. However, this isn't very good for instantiating empty sets, because then Python will think that it's a dictionary. So to achieve what you're doing, you could make the following function:
def removeDuplicates(string):
return ''.join(sorted(set(string)))
Dorian's answer IS the way to go for any practical application, so my addition is mostly toying around.
If a word is really long, it's more efficient to just search whether each letter in the alphabet is in the string and keep only those that are present. Explicitly,
from string import ascii_lowercase
def removeDuplicates(string):
return ''.join(letter for letter in ascii_lowercase if letter in string)
Code to test timings
import random
import timeit
def compare(string, n):
s1 = "''.join(sorted(set('{}')))".format(string)
print timeit.timeit(s1, number=n)
s2 = "from string import ascii_lowercase; ''.join(letter for letter in ascii_lowercase if letter in '{}')".format(string)
print timeit.timeit(s2, number=n)
Tests:
>>> word = 'cbacdcbc'
>>> compare(word, 1000)
0.00385931823843
0.013727678263
>>> word = ''.join(random.choice(ascii_lowercase) for _ in xrange(100000))
>>> compare(word, 1000)
2.21139290323
0.0071371927042
>>> word = 'a'*100000 + ascii_lowercase
>>> compare(word, 1000)
2.20644530225
1.63490857359
This shows that Dorian's answer should perform equally well or even better for small words, even though the speed isn't noticeable by humans. However, for very large strings, this method is much faster. Even for an edge case, where every letter is the same and the rest of the letters can only be found by transversing the whole string it performs better.
Still, Dorian's answer is more elegant and practical.
This is what makes the test succeed.
def removeDuplicates(my_string):
for char in sorted(set(my_string)):
suffix = my_string[my_string.index(char):]
if set(suffix) == set(my_string):
return char + removeDuplicates(suffix.replace(char, ''))
return ''
print removeDuplicates('cbacdcbc')
acdb

HOW TO "Arbitrary" format items in list/dict/etc. EX: change 4th character in every string in list

first of all i want to mention that there might not be any real life applications for this simple script i created, but i did it because I'm learning and I couldn't find anything similar here in SO. I wanted to know what could be done to "arbitrarily" change characters in an iterable like a list.
Sure tile() is a handy tool I learned relatively quick, but then I got to think what if, just for kicks, i wanted to format (upper case) the last character instead? or the third, the middle one,etc. What about lower case? Replacing specific characters with others?
Like I said this is surely not perfect but could give away some food for thought to other noobs like myself. Plus I think this can be modified in hundreds of ways to achieve all kinds of different formatting.
How about helping me improve what I just did? how about making it more lean and mean? checking for style, methods, efficiency, etc...
Here it goes:
words = ['house', 'flower', 'tree'] #string list
counter = 0 #counter to iterate over the items in list
chars = 4 #character position in string (0,1,2...)
for counter in range (0,len(words)):
while counter < len(words):
z = list(words[counter]) # z is a temp list created to slice words
if len(z) > chars: # to compare char position and z length
upper = [k.upper() for k in z[chars]] # string formatting EX: uppercase
z[chars] = upper [0] # replace formatted character with original
words[counter] = ("".join(z)) # convert and replace temp list back into original word str list
counter +=1
else:
break
print (words)
['housE', 'flowEr', 'tree']
This is somewhat of a combination of both (so +1 to both of them :) ). The main function accepts a list, an arbitrary function and the character to act on:
In [47]: def RandomAlter(l, func, char):
return [''.join([func(w[x]) if x == char else w[x] for x in xrange(len(w))]) for w in l]
....:
In [48]: RandomAlter(words, str.upper, 4)
Out[48]: ['housE', 'flowEr', 'tree']
In [49]: RandomAlter([str.upper(w) for w in words], str.lower, 2)
Out[49]: ['HOuSE', 'FLoWER', 'TReE']
In [50]: RandomAlter(words, lambda x: '_', 4)
Out[50]: ['hous_', 'flow_r', 'tree']
The function RandomAlter can be rewritten as this, which may make it a bit more clear (it takes advantage of a feature called list comprehensions to reduce the lines of code needed).
def RandomAlter(l, func, char):
# For each word in our list
main_list = []
for w in l:
# Create a container that is going to hold our new 'word'
new_word = []
# Iterate over a range that is equal to the number of chars in the word
# xrange is a more memory efficient 'range' - same behavior
for x in xrange(len(w)):
# If the current position is the character we want to modify
if x == char:
# Apply the function to the character and append to our 'word'
# This is a cool Python feature - you can pass around functions
# just like any other variable
new_word.append(func(w[x]))
else:
# Just append the normal letter
new_word.append(w[x])
# Now we append the 'word' to our main_list. However since the 'word' is
# a list of letters, we need to 'join' them together to form a string
main_list.append(''.join(new_word))
# Now just return the main_list, which will be a list of altered words
return main_list
There's much better Pythonistas than me, but here's one attempt:
[''.join(
[a[x].upper() if x == chars else a[x]
for x in xrange(0,len(a))]
)
for a in words]
Also, we're talking about the programmer's 4th, right? What everyone else calls 5th, yes?
Some comments on your code:
for counter in range (0,len(words)):
while counter < len(words):
This won't compile unless you indent the while loop under the for loop. And, if you do that, the inner loop will completely screw up the loop counter for the outer loop. And finally, you almost never want to maintain an explicit loop counter in Python. You probably want this:
for counter, word in enumerate(words):
Next:
z = list(words[counter]) # z is a temp list created to slice words
You can already slice strings, in exactly the same way you slice lists, so this is unnecessary.
Next:
upper = [k.upper() for k in z[chars]] # string formatting EX: uppercase
This is a bad name for the variable, since there's a function with the exact same name—which you're calling on the same line.
Meanwhile, the way you defined things, z[chars] is a character, a copy of words[4]. You can iterate over a single character in Python, because each character is itself a string. but it's generally pointless—[k.upper() for k in z[chars]] is the same thing as [z[chars].upper()].
z[chars] = upper [0] # replace formatted character with original
So you only wanted the list of 1 character to get the first character out of it… why make it a list in the first place? Just replace the last two lines with z[chars] = z[chars].upper().
else:
break
This is going to stop on the first string shorter than length 4, rather than just skip strings shorter than length 4, which is what it seems like you want. The way to say that is continue, not break. Or, better, just fall off the end of the list. In some cases, it's hard to write things without a continue, but in this case, it's easy—it's already at the end of the loop, and in fact it's inside an else: that has nothing else in it, so just remove both lines.
It's hard to tell with upper that your loops are wrong, because if you accidentally call upper twice, it looks the same as if you called it once. Change the upper to chr(ord(k)+1), which replaces any letter with the next letter. Then try it with:
words = ['house', 'flower', 'tree', 'a', 'abcdefgh']
You'll notice that, e.g., you get 'flowgr' instead of 'flowfr'.
You may also want to add a variable that counts up the number of times you run through the inner loop. It should only be len(words) times, but it's actually len(words) * len(words) if you have no short words, or len(words) * len(<up to the first short word>) if you have any. You're making the computer do a whole lot of extra work—if you have 1000 words, it has to do 1000000 loops instead of 1000. In technical terms, your algorithm is O(N^2), even though it only needs to be O(N).
Putting it all together:
words = ['house', 'flower', 'tree', 'a', 'abcdefgh'] #string list
chars = 4 #character position in string (0,1,2...)
for counter, word in enumerate(words):
if len(word) > chars: # to compare char position and z length
z = list(word)
z[chars] = chr(ord(z[chars]+1) # replace character with next character
words[counter] = "".join(z) # convert and replace temp list back into original word str list
print (words)
That does the same thing as your original code (except using "next character" instead of "uppercase character"), without the bugs, with much less work for the computer, and much easier to read.
I think the general case of what you're talking about is a method that, given a string and an index, returns that string, with the indexed character transformed according to some rule.
def transform_string(strng, index, transform):
lst = list(strng)
if index < len(lst):
lst[index] = transform(lst[index])
return ''.join(lst)
words = ['house', 'flower', 'tree']
output = [transform_string(word, 4, str.upper) for word in words]
To make it even more abstract, you could have a factory that returns a method, like so:
def transformation_factory(index, transform):
def inner(word):
lst = list(word)
if index < len(lst):
lst[index] = transform(lst[index])
return inner
transform = transformation_factory(4, lambda x: x.upper())
output = map(transform, words)

String manipulation weirdness when incrementing trailing digit

I got this code:
myString = 'blabla123_01_version6688_01_01Long_stringWithNumbers'
versionSplit = re.findall(r'-?\d+|[a-zA-Z!##$%^&*()_+.,<>{}]+|\W+?', myString)
for i in reversed(versionSplit):
id = versionSplit.index(i)
if i.isdigit():
digit = '%0'+str(len(i))+'d'
i = int(i) + 1
i = digit % i
versionSplit[id]=str(i)
break
final = ''
myString = final.join(versionSplit)
print myString
Which suppose to increase ONLY the last digit from the string given. But if you run that code you will see that if there is the same digit in the string as the last one it will increase it one after the other if you keep running the script. Can anyone help me find out why?
Thank you in advance for any help
Is there a reason why you aren't doing something like this instead:
prefix, version = re.match(r"(.*[^\d]+)([\d]+)$", myString).groups()
newstring = prefix + str(int(version)+1).rjust(len(version), '0')
Notes:
This will actually "carry over" the version numbers properly: ("09" -> "10") and ("99" -> "100")
This regex assumes at least one non-numeric character before the final version substring at the end. If this is not matched, it will throw an AttributeError. You could restructure it to throw a more suitable or specific exception (e.g. if re.match(...) returns None; see comments below for more info).
Adjust accordingly.
The issue is the use of the list.index() function on line 5. This returns the index of the first occurrence of a value in a list, from left to right, but the code is iterating over the reversed list (right to left). There are lots of ways to straighten this out, but here's one that makes the fewest changes to your existing code: Iterate over indices in reverse (avoids reversing the list).
for idx in range(len(versionSplit)-1, -1, -1):
i = versionSplit[idx]
if chunk.isdigit():
digit = '%0'+str(len(i))+'d'
i = int(i) + 1
i = digit % i
versionSplit[idx]=str(i)
break
myString = 'blabla123_01_version6688_01_01veryLong_stringWithNumbers01'
versionSplit = re.findall(r'-?\d+|[^\-\d]+', myString)
for i in xrange(len(versionSplit) - 1, -1, -1):
s = versionSplit[i]
if s.isdigit():
n = int(s) + 1
versionSplit[i] = "%0*d" % (len(s), n)
break
myString = ''.join(versionSplit)
print myString
Notes:
It is silly to use the .index() method to try to find the string. Just use a decrementing index to try each part of versionSplit. This was where your problem was, as commented above by #David Robinson.
Don't use id as a variable name; you are covering up the built-in function id().
This code is using the * in a format template, which will accept an integer and set the width.
I simplified the pattern: either you are matching a digit (with optional leading minus sign) or else you are matching non-digits.
I tested this and it seems to work.
First, three notes:
id is a reserved python word;
For joining, a more pythonic idiom is ''.join(), using a literal empty string
reversed() returns an iterator, not a list. That's why I use list(reversed()), in order to do rev.index(i) later.
Corrected code:
import re
myString = 'blabla123_01_version6688_01_01veryLong_stringWithNumbers01'
print myString
versionSplit = re.findall(r'-?\d+|[a-zA-Z!##$%^&*()_+.,<>{}]+|\W+?', myString)
rev = list(reversed(versionSplit)) # create a reversed list to work with from now on
for i in rev:
idd = rev.index(i)
if i.isdigit():
digit = '%0'+str(len(i))+'d'
i = int(i) + 1
i = digit % i
rev[idd]=str(i)
break
myString = ''.join(reversed(rev)) # reverse again only just before joining
print myString

Categories