Splitting a number pattern - python

I want to now how do i split a string like
44664212666666 into [44664212 , 666666] or
58834888888888 into [58834, 888888888]
without knowing where the first occurrence of the last recurring digit occurs.
so passing it to a function say seperate(str) --> [non_recurring_part, end_recurring digits]

print re.findall(r'^(.+?)((.)\3+)$', '446642126666')[0][:-1] # ('44664212', '6666')
As pointed out in the comments, the last group should be made optional to handle strings with no repeated symbols correctly:
print re.findall(r'^(.+?)((.)\3+)?$', '12333')[0][:-1] # ('12', '333')
print re.findall(r'^(.+?)((.)\3+)?$', '123')[0][:-1] # ('123', '')

Same answer as Justin:
>>> for i in range(len(s) - 1, 0, -1):
if s[i] != s[-1]:
break
>>> non_recurring_part, end_recurring_digits = s[:i], s[i + 1:]
>>> non_recurring_part, end_recurring_digits
('4466421', '666666')

Here is a non-regex answer that deals with cases when there are no repeating digits.
def separate(s):
last = s[-1]
t = s.rstrip(last)
if len(t) + 1 == len(s):
return (s, '')
else:
return t, last * (len(s) - len(t))
Examples:
>>> separate('123444')
('123', '444')
>>> separate('1234')
('1234', '')
>>> separate('11111')
('', '11111')

Can't you just scan from the last character to the first character and stop when the next char doesn't equal the previous. Then split at that index.

def separate(n):
s = str(n)
return re.match(r'^(.*?)((.)\3*)$', s).groups()

def seperate(s):
return re.findall('^(.+?)('+s[-1]+'+)$',s)

>>> import re
>>> m = re.match(r'(.*?)((.)\3+)$', '1233333')
>>> print list(m.groups())[:2]
['12', '33333']
Here you use regular expressions. The last part of the re ((.)\3+)$ says that the same number must be repeated till the end of the string. And all the rest is the first part of the string. The function m.groups() return the list of the string that correspond to the () parts of the re. The 0 element contains the first part; the 1 element contains the second part. The third part is not needed, we can just ignore it.
Another important point is ? in .*?. Using the symbol you say that you need non-greedy search. That means that you need to switch to the second part of re as soon as possible.

start iterating from the end,towards the initial digit, just get the position where the character occurring changes, that should be the limit for sub string splitting, Let that limit index is--> i, Then Your Result will be-->{sub-string [0,i) , sub-string [i,size)},, That will solve your problem..
int pos=0;
String str="ABCDEF";
for (int i = str.length()-1; i > 0; i--)
{
if(str.charAt(i) != str.charAt(i-1))
{
pos=i;
break;
}
}
String sub1=str.substring(0, pos);
String sub2=str.substring(pos);

Related

StarKill riddle in Python

Riddle:
Return a version of the given string, where for every star (*) in the string the star and the chars immediately to its left and right are gone. So "ab*cd" yields "ad" and "ab**cd" also yields "ad".
I'm wondering if there's a pythonish way to improve this algorithm:
def starKill(string):
result = ''
for idx in range(len(string)):
if(idx == 0 and string[idx] != '*'):
result += string[idx]
elif (idx > 0 and string[idx] != '*' and (string[idx-1]) != '*'):
result += string[idx]
elif (idx > 0 and string[idx] == '*' and (string[idx-1]) != '*'):
result = result[0:len(result) - 1]
return result
starKill("wacy*xko") yields wacko
Here's a numpy solution just for fun:
def star_kill(string, target='*'):
arr = np.array(list(string))
mask = arr != '*'
mask[1:] &= mask[:-1]
mask[:-1] &= mask[1:]
arr = arr[mask]
return arr[mask].view(dtype=f'U{arr.size}').item()
Regular expression?
>>> import re
>>> for s in "ab*cd", "ab**cd", "wacy*xko", "*Mad*Physicist*":
print(re.sub(r'\w?\*\w?', '', s))
ad
ad
wacko
ahysicis
You can do this by iterating over the string three times in parallel. Each iteration will be shifted relative to the next by one character. The middle one is the one that will provide the valid letters, the other two let us check if adjacent characters are stars. The two flanking iterators require dummy values to represent "before the start" and "after the end" of the string. There are a variety of ways to set that up, I'm using itertools.chain (and .islice) to fill in None for the dummy values. But you could use plain string and iterator manipulation if you prefer (i.e. iter('x' + string) and iter(string[1:] + 'x')):
import itertools
def star_kill(string):
main_iterator = iter(string)
look_behind = itertools.chain([None], string)
look_ahead = itertools.chain(itertools.islice(string, 1, None), [None])
return "".join(a for a, b, c in zip(main_iterator, look_behind, look_ahead)
if a != '*' and b != '*' and c != '*')
Not sure whether or not it's "Pythonic," but the problem can be solved with regular expressions.
import re
def starkill(s):
s = re.sub(".{0,1}\\*{1,}.{0,1}", "", s)
return s
For those not familiar with regex, I'll break that long string down:
Prefix
".{0,1}"
This specifies we want the replaced section to begin with either 0 or 1 of any character. If there is a character before the star, we want to replace it; otherwise, we still want the expression to hit if the star is at the very beginning of the input string.
Star
"\\*{1,}"
This specifies that the middle of the expression must contain an asterisk character, but it can also contain more than one. For instance, "a****b" will still hit, even though there are four stars. We need a backslash before the asterisk because regex has asterisk as a reserved character, and we need a second backslash before that because Python strings reserve the backslash character.
Suffix
.{0,1}
Same as the prefix. The expression can either end with one or zero of any character.
Hope that helps!

partition a string by dash (-) python

I want to get a string and divide it into parts separated by "-".
Input:
aabbcc
And output:
aa-bb-cc
is there a way to do so?
If you want to do it based on the same letter then you can use itertools.groupby() to do this, e.g.:
In []:
import itertools as it
s = 'aabbcc'
'-'.join(''.join(g) for k, g in it.groupby(s))
Out[]:
'aa-bb-cc'
Or if you want it in chunks of 2 you can use iter() and zip():
In []:
n = 2
'-'.join(''.join(p) for p in zip(*[iter(s)]*n))
Out[]:
'aa-bb-cc'
Note: if the string length is not divisible by 2 this will drop the last character - you can replace zip(...) with itertools.zip_longest(..., fillvalue='') but it is unclear if the OP has this issue)
If you consider creating pair-divided by a dash, you can use the below function:
def pair_div(string):
newString=str() #for storing the divided string
for i,s in enumerate(string):
if i%2!=0 and i<(len(string)-1): #we make sure the function divides every two chars but not the last character of string.
newString+=s+'-' #If it is the second member of pair, add a dash after it
else:
newString+=s #If not, just add the character
return(newString)
And for example:
[In]:string="aazzxxcceewwqqbbvvaa"
[Out]:'aa-zz-xx-cc-ee-ww-qq-bb-vv-aa'
But if you consider dividing same characters as a group and separate with a dash, you better your regex methods.
BR,
Shend
You can try
data = "aabbcc"
"-".join([data[x:x+2] for x in range(0, len(data), 2)])
if you want to divide the string into block of 2 characters, then this will help you.
import textwrap
s='aabbcc'
lst=textwrap.wrap(s,2)
print('-'.join(lst))
2nd attribute defines the no. of characters you want in a particular group
s = 'aabbccdd'
#index 01234567
new_s = ''
1)
for idx, char in enumerate(s):
new_s+=char
if idx%2 != 0:
new_s += '-'
print(new_s.strip('-'))
# aa-bb-cc-dd
2)
new_s = ''.join([s[i]+'-' if i%2 != 0 else s[i] for i in range(len(s))]).strip('-')
print(new_s)
# aa-bb-cc-dd

Remove punctuation items from end of string

I have a seemingly simple problem, which I cannot seem to solve. Given a string containing a DOI, I need to remove the last character if it is a punctuation mark until the last character is letter or number.
For example, if the string was:
sampleDoi = "10.1097/JHM-D-18-00044.',"
I want the following output:
"10.1097/JHM-D-18-00044"
ie. remove .',
I wrote the following script to do this:
invalidChars = set(string.punctuation.replace("_", ""))
a = "10.1097/JHM-D-18-00044.',"
i = -1
for each in reversed(a):
if any(char in invalidChars for char in each):
a = a[:i]
i = i - 1
else:
print (a)
break
However, this produces 10.1097/JHM-D-18-00 but I would like it to produce 10.1097/JHM-D-18-00044. Why is the 44 removed from the end?
The string function rstrip() is designed to do exactly this:
>>> sampleDoi = "10.1097/JHM-D-18-00044.',"
>>> sampleDoi.rstrip(",.'")
'10.1097/JHM-D-18-00044'
Corrected code:
import string
invalidChars = set(string.punctuation.replace("_", ""))
a = "10.1097/JHM-D-18-00044.',"
i = -1
for each in reversed(a):
if any(char in invalidChars for char in each):
a = a[:i]
i = i # Well Really this line can just be removed all together.
else:
print (a)
break
This gives the output you want, while keeping the original code mostly the same.
This is one way using next and str.isalnum with a generator expression utilizing enumerate / reversed.
sampleDoi = "10.1097/JHM-D-18-00044.',"
idx = next((i for i, j in enumerate(reversed(sampleDoi)) if j.isalnum()), 0)
res = sampleDoi[:-idx]
print(res)
'10.1097/JHM-D-18-00044'
The default parameter 0is used so that, if no alphanumeric character is found, an empty string is returned.
If you dont wanna use regex:
the_str = "10.1097/JHM-D-18-00044.',"
while the_str[-1] in string.punctuation:
the_str = the_str[:-1]
Removes the last character until it's no longer a punctuation character.

Count spaces in text (treat consecutive spaces as one)

How would you count the number of spaces or new line charaters in a text in such a way that consecutive spaces are counted only as one?
For example, this is very close to what I want:
string = "This is an example text.\n But would be good if it worked."
counter = 0
for i in string:
if i == ' ' or i == '\n':
counter += 1
print(counter)
However, instead of returning with 15, the result should be only 11.
The default str.split() function will treat consecutive runs of spaces as one. So simply split the string, get the size of the resulting list, and subtract one.
len(string.split())-1
Assuming you are permitted to use Python regex;
import re
print len(re.findall(ur"[ \n]+", string))
Quick and easy!
UPDATE: Additionally, use [\s] instead of [ \n] to match any whitespace character.
You can do this:
string = "This is an example text.\n But would be good if it worked."
counter = 0
# A boolean flag indicating whether the previous character was a space
previous = False
for i in string:
if i == ' ' or i == '\n':
# The current character is a space
previous = True # Setup for the next iteration
else:
# The current character is not a space, check if the previous one was
if previous:
counter += 1
previous = False
print(counter)
re to the rescue.
>>> import re
>>> string = "This is an example text.\n But would be good if it worked."
>>> spaces = sum(1 for match in re.finditer('\s+', string))
>>> spaces
11
This consumes minimal memory, an alternative solution that builds a temporary list would be
>>> len(re.findall('\s+', string))
11
If you only want to consider space characters and newline characters (as opposed to tabs, for example), use the regex '(\n| )+' instead of '\s+'.
Just store a character that was the last character found. Set it to i each time you loop. Then within your inner if, do not increase the counter if the last character found was also a whitespace character.
You can iterate through numbers to use them as indexes.
for i in range(1, len(string)):
if string[i] in ' \n' and string[i-1] not in ' \n':
counter += 1
if string[0] in ' \n':
counter += 1
print(counter)
Pay attention to the first symbol as this constuction starts from the second symbol to prevent IndexError.
You can use enumerate, checking the next char is not also whitespace so consecutive whitespace will only count as 1:
string = "This is an example text.\n But would be good if it worked."
print(sum(ch.isspace() and not string[i:i+1].isspace() for i, ch in enumerate(string, 1)))
You can also use iter with a generator function, keeping track of the last character and comparing:
def con(s):
it = iter(s)
prev = next(it)
for ele in it:
yield prev.isspace() and not ele.isspace()
prev = ele
yield ele.isspace()
print(sum(con(string)))
An itertools version:
string = "This is an example text.\n But would be good if it worked. "
from itertools import tee, izip_longest
a, b = tee(string)
next(b)
print(sum(a.isspace() and not b.isspace() for a,b in izip_longest(a,b, fillvalue="") ))
Try:
def word_count(my_string):
word_count = 1
for i in range(1, len(my_string)):
if my_string[i] == " ":
if not my_string[i - 1] == " ":
word_count += 1
return word_count
You can use the function groupby() to find groups of consecutive spaces:
from collections import Counter
from itertools import groupby
s = 'This is an example text.\n But would be good if it worked.'
c = Counter(k for k, _ in groupby(s, key=lambda x: ' ' if x == '\n' else x))
print(c[' '])
# 11

String manipulation weirdness when incrementing trailing digit

I got this code:
myString = 'blabla123_01_version6688_01_01Long_stringWithNumbers'
versionSplit = re.findall(r'-?\d+|[a-zA-Z!##$%^&*()_+.,<>{}]+|\W+?', myString)
for i in reversed(versionSplit):
id = versionSplit.index(i)
if i.isdigit():
digit = '%0'+str(len(i))+'d'
i = int(i) + 1
i = digit % i
versionSplit[id]=str(i)
break
final = ''
myString = final.join(versionSplit)
print myString
Which suppose to increase ONLY the last digit from the string given. But if you run that code you will see that if there is the same digit in the string as the last one it will increase it one after the other if you keep running the script. Can anyone help me find out why?
Thank you in advance for any help
Is there a reason why you aren't doing something like this instead:
prefix, version = re.match(r"(.*[^\d]+)([\d]+)$", myString).groups()
newstring = prefix + str(int(version)+1).rjust(len(version), '0')
Notes:
This will actually "carry over" the version numbers properly: ("09" -> "10") and ("99" -> "100")
This regex assumes at least one non-numeric character before the final version substring at the end. If this is not matched, it will throw an AttributeError. You could restructure it to throw a more suitable or specific exception (e.g. if re.match(...) returns None; see comments below for more info).
Adjust accordingly.
The issue is the use of the list.index() function on line 5. This returns the index of the first occurrence of a value in a list, from left to right, but the code is iterating over the reversed list (right to left). There are lots of ways to straighten this out, but here's one that makes the fewest changes to your existing code: Iterate over indices in reverse (avoids reversing the list).
for idx in range(len(versionSplit)-1, -1, -1):
i = versionSplit[idx]
if chunk.isdigit():
digit = '%0'+str(len(i))+'d'
i = int(i) + 1
i = digit % i
versionSplit[idx]=str(i)
break
myString = 'blabla123_01_version6688_01_01veryLong_stringWithNumbers01'
versionSplit = re.findall(r'-?\d+|[^\-\d]+', myString)
for i in xrange(len(versionSplit) - 1, -1, -1):
s = versionSplit[i]
if s.isdigit():
n = int(s) + 1
versionSplit[i] = "%0*d" % (len(s), n)
break
myString = ''.join(versionSplit)
print myString
Notes:
It is silly to use the .index() method to try to find the string. Just use a decrementing index to try each part of versionSplit. This was where your problem was, as commented above by #David Robinson.
Don't use id as a variable name; you are covering up the built-in function id().
This code is using the * in a format template, which will accept an integer and set the width.
I simplified the pattern: either you are matching a digit (with optional leading minus sign) or else you are matching non-digits.
I tested this and it seems to work.
First, three notes:
id is a reserved python word;
For joining, a more pythonic idiom is ''.join(), using a literal empty string
reversed() returns an iterator, not a list. That's why I use list(reversed()), in order to do rev.index(i) later.
Corrected code:
import re
myString = 'blabla123_01_version6688_01_01veryLong_stringWithNumbers01'
print myString
versionSplit = re.findall(r'-?\d+|[a-zA-Z!##$%^&*()_+.,<>{}]+|\W+?', myString)
rev = list(reversed(versionSplit)) # create a reversed list to work with from now on
for i in rev:
idd = rev.index(i)
if i.isdigit():
digit = '%0'+str(len(i))+'d'
i = int(i) + 1
i = digit % i
rev[idd]=str(i)
break
myString = ''.join(reversed(rev)) # reverse again only just before joining
print myString

Categories