Split string at nth occurrence of a given character - python

Is there a Python-way to split a string after the nth occurrence of a given delimiter?
Given a string:
'20_231_myString_234'
It should be split into (with the delimiter being '_', after its second occurrence):
['20_231', 'myString_234']
Or is the only way to accomplish this to count, split and join?

>>> n = 2
>>> groups = text.split('_')
>>> '_'.join(groups[:n]), '_'.join(groups[n:])
('20_231', 'myString_234')
Seems like this is the most readable way, the alternative is regex)

Using re to get a regex of the form ^((?:[^_]*_){n-1}[^_]*)_(.*) where n is a variable:
n=2
s='20_231_myString_234'
m=re.match(r'^((?:[^_]*_){%d}[^_]*)_(.*)' % (n-1), s)
if m: print m.groups()
or have a nice function:
import re
def nthofchar(s, c, n):
regex=r'^((?:[^%c]*%c){%d}[^%c]*)%c(.*)' % (c,c,n-1,c,c)
l = ()
m = re.match(regex, s)
if m: l = m.groups()
return l
s='20_231_myString_234'
print nthofchar(s, '_', 2)
Or without regexes, using iterative find:
def nth_split(s, delim, n):
p, c = -1, 0
while c < n:
p = s.index(delim, p + 1)
c += 1
return s[:p], s[p + 1:]
s1, s2 = nth_split('20_231_myString_234', '_', 2)
print s1, ":", s2

I like this solution because it works without any actuall regex and can easiely be adapted to another "nth" or delimiter.
import re
string = "20_231_myString_234"
occur = 2 # on which occourence you want to split
indices = [x.start() for x in re.finditer("_", string)]
part1 = string[0:indices[occur-1]]
part2 = string[indices[occur-1]+1:]
print (part1, ' ', part2)

I thought I would contribute my two cents. The second parameter to split() allows you to limit the split after a certain number of strings:
def split_at(s, delim, n):
r = s.split(delim, n)[n]
return s[:-len(r)-len(delim)], r
On my machine, the two good answers by #perreal, iterative find and regular expressions, actually measure 1.4 and 1.6 times slower (respectively) than this method.
It's worth noting that it can become even quicker if you don't need the initial bit. Then the code becomes:
def remove_head_parts(s, delim, n):
return s.split(delim, n)[n]
Not so sure about the naming, I admit, but it does the job. Somewhat surprisingly, it is 2 times faster than iterative find and 3 times faster than regular expressions.
I put up my testing script online. You are welcome to review and comment.

>>>import re
>>>str= '20_231_myString_234'
>>> occerence = [m.start() for m in re.finditer('_',str)] # this will give you a list of '_' position
>>>occerence
[2, 6, 15]
>>>result = [str[:occerence[1]],str[occerence[1]+1:]] # [str[:6],str[7:]]
>>>result
['20_231', 'myString_234']

It depends what is your pattern for this split. Because if first two elements are always numbers for example, you may build regular expression and use re module. It is able to split your string as well.

I had a larger string to split ever nth character, ended up with the following code:
# Split every 6 spaces
n = 6
sep = ' '
n_split_groups = []
groups = err_str.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
print n_split_groups
Thanks #perreal!

In function form of #AllBlackt's solution
def split_nth(s, sep, n):
n_split_groups = []
groups = s.split(sep)
while len(groups):
n_split_groups.append(sep.join(groups[:n]))
groups = groups[n:]
return n_split_groups
s = "aaaaa bbbbb ccccc ddddd eeeeeee ffffffff"
print (split_nth(s, " ", 2))
['aaaaa bbbbb', 'ccccc ddddd', 'eeeeeee ffffffff']

As #Yuval has noted in his answer, and #jamylak commented in his answer, the split and rsplit methods accept a second (optional) parameter maxsplit to avoid making splits beyond what is necessary. Thus, I find the better solution (both for readability and performance) is this:
s = '20_231_myString_234'
first_part = text.rsplit('_', 2)[0] # Gives '20_231'
second_part = text.split('_', 2)[2] # Gives 'myString_234'
This is not only simple, but also avoids performance hits of regex solutions and other solutions using join to undo unnecessary splits.

Related

Find all permutation of a string in another string

I have a string, for example: string1 = 'abcdbcabdcabb'.
And I have another string, for example: string2 = 'cab'
I need to count all permutation of string2 in string1.
Currently I'm adding all permutation of string2 to a list,
and than iterating threw string1 by index+string.size and checking
if sub-string of string1 contain in the list of the permutations
I'm sure there is a better optimized way to do it.
You do not need DP in my mind, but a sliding window technic. A permutation of string2 is a string that has exactly the same length and the distribution of the characters is the same. In your example of string2, a permutation is. a string of length 3 with this distribution of characters: {a:1,b:1,c:1}.
So you can write a script, to consider a window of size N (size of string2), from the beginning of string1(index=0). if your current window has exactly the same distribution of characters, you accept it as a permutation, if not you do not count it, and you move on to index+1.
A trick for not recalculating the character distribution in each sliding window, you can get a dictionary of characters, and count the characters at the very first window, then when you slide the window one to the right, you decrease the removing character by one, and increase the adding character by 1.
The code should be something like this, you need to verify it for edge cases:
def get_permut(string1,string2):
N =len(string2)
M = len(string1)
if M < N:
return 0
valid_dist = dict()
for ch in string2:
valid_dist.setdefault(ch,0)
valid_dist[ch]+=1
current_dist=dict()
for ch in string1[:N]:
current_dist.setdefault(ch,0)
current_dist[ch]+=1
ct=0
for i in range(M-N):
if current_dist == valid_dist:
ct+=1
current_dist[i]-=1
current_dist.setdefault(i+1,0)
current_dist[i+1]+=1
if current_dist[i]==0:
del current_dist[i]
return ct
You can use string.count() method here. See below for some way to resolve it:
import itertools
perms=[''.join(i) for i in itertools.permutations(string2)]
res=0
for i in perms:
res+= string1.count(i)
print(res)
# 4
You can use regex for that.
def lst_permutation(text):
from itertools import permutations
lst_p=[]
for i in permutations(text):
lst_p.append(''.join(i))
return lst_p
def total_permutation(string1,string2):
import re
total=0
for i in lst_permutation(string2):
res=re.findall(string2,string1)
total += len(res)
return total
string1 = 'abcdbcabdcabb'
string2 = 'cab'
print(total_permutation(string1,string2))
#12
Here's a dumb way to do it with a regex (don't actually do this).
Use a non capturing group for each letter in the search text, and then expect one of each captured group to appear in the output:
import re
string1 = 'abcdbcabdcabb'
string2 = r'(?:c()|a()|b()){3}\1\2\3'
pos = 0
r = re.compile(string2)
while m := r.search(string1, pos=pos):
print(m.group())
pos = m.start() + 1
abc
bca
cab
cab
Can also dynamically generate it
import re
string1 = 'abcdbcabdcabb'
string2 = 'cab'
before = "|".join([f"{l}()" for l in string2])
matches = "".join([f"\\{i + 1}" for i in range(len(string2))])
r = re.compile(f"(?:{before}){{{len(string2)}}}{matches}")
pos = 0
while m := r.search(string1, pos=pos):
print(m.group())
pos = m.start() + 1

is there a more efficient way than this method?

The Job is to decompress a string.
For example:
if a string is 'a3b4c2' then decompress it as 'aaabbbbcc'.
the previous code i tried is
list1 = [i for i in a]
listNum = list(map(int,list(filter(lambda x:x.isdigit(),list1))))
listChar = list(filter(lambda x: not x.isdigit(),list1))
b = ''
for i in range(len(listNum)):
b += listChar[i]*listNum[i]
print(b)
I think it is a pretty simple problem, but my code seems clumsy, is there any other method to do it?.
import re
b = ''.join(c * int(n) for c, n in re.findall(r'(\w)(\d+)', a))
The regex will match each letter with the following number (accommodating multi-digit numbers) and return them in groups:
>>> re.findall(r'(\w)(\d+)', a)
[('a', '3'), ('b', '4'), ('c', '2')]
Then you just need to iterate over them…
for c, n in ...
# c = 'a'
# n = '3'
# ...
…and multiply them…
c * int(n)
…and simply do that in a generator expression…
c * int(n) for c, n in re.findall(r'(\w)(\d+)', a)
…and ''.join all the resulting small strings together.
For fun, here's a version that even allows standalone letters without numbers:
a = 'a3bc4d2e'
b = ''.join(c * int(n or 1) for c, n in re.findall(r'(\w)(\d+)?', a))
# aaabccccdde
Just another way, zip + splicing,
>>> value = 'a3b4c2'
>>>
>>> "".join(x*int(y) for x, y in zip(value[0::2], value[1::2]))
'aaabbbbcc'
>>>
You can use list comprehension for a one line solution:
input='a3b4c2'
result=''.join(input[i] * int(input[i+1]) for i in range(0,len(input),2))
Output:
>>> result
aaabbbbcc
The * operator can be used to multiply an integer with a character.
The join method is called to join the list of the substrings to the full string.
You might do it using regular expressions (re module), using grouping and function as 2nd re.sub argument following way
import re
a = 'a3b4c2'
def decompress(x):
return x.group(1)*int(x.group(2))
output = re.sub(r'(\D)(\d+)', decompress, a)
print(output) # aaabbbbcc
Explanation I am looking in string for single non-digit (\D) followed by one or more digits (\d+). For every match first is put into 1st group and latter into 2nd group, hence brackets in pattern. Then every match is replaced by content of 1st group (which is string) times value of content of 2nd group. Note that I used int to get that value as attempt of direct multiplying would fail (you can not multiply string by string).
Iterate the string pairwise using zip, to get the char c and int n as separate elements and then replicate the char c for n times
>>> str = 'a3b4c2'
>>> s = iter(str)
>>> ''.join(c*int(n) for c,n in zip(s, s))
'aaabbbbcc'

Is there an easy way to get the number of repeating character in a word?

I'm trying to get how many any character repeats in a word. The repetitions must be sequential.
For example, the method with input "loooooveee" should return 6 (4 times 'o', 2 times 'e').
I'm trying to implement string level functions and I can do it this way but, is there an easy way to do this? Regex, or some other sort of things?
Original question: order of repetition does not matter
You can subtract the number of unique letters by the number of total letters. set applied to a string will return a unique collection of letters.
x = "loooooveee"
res = len(x) - len(set(x)) # 6
Or you can use collections.Counter, subtract 1 from each value, then sum:
from collections import Counter
c = Counter("loooooveee")
res = sum(i-1 for i in c.values()) # 6
New question: repetitions must be sequential
You can use itertools.groupby to group sequential identical characters:
from itertools import groupby
g = groupby("aooooaooaoo")
res = sum(sum(1 for _ in j) - 1 for i, j in g) # 5
To avoid the nested sum calls, you can use itertools.islice:
from itertools import groupby, islice
g = groupby("aooooaooaoo")
res = sum(1 for _, j in g for _ in islice(j, 1, None)) # 5
You could use a regular expression if you want:
import re
rx = re.compile(r'(\w)\1+')
repeating = sum(x[1] - x[0] - 1
for m in rx.finditer("loooooveee")
for x in [m.span()])
print(repeating)
This correctly yields 6 and makes use of the .span() function.
The expression is
(\w)\1+
which captures a word character (one of a-zA-Z0-9_) and tries to repeat it as often as possible.
See a demo on regex101.com for the repeating pattern.
If you want to match any character (that is, not only word characters), change your expression to:
(.)\1+
See another demo on regex101.com.
try this:
word=input('something:')
sum = 0
chars=set(list(word)) #get the set of unique characters
for item in chars: #iterate over the set and output the count for each item
if word.count(char)>1:
sum+=word.count(char)
print('{}|{}'.format(item,str(word.count(char)))
print('Total:'+str(sum))
EDIT:
added total count of repetitions
Since it doesn't matter where the repetition is occurring or which characters are being repeated, you can make use of the set data structure provided in Python. It will discard the duplicate occurrences of any character or an object.
Therefore, the solution would look something like this:
def measure_normalized_emphasis(text):
return len(text) - len(set(text))
This will give you the exact result.
Also, make sure to look out for some edge cases, which you should as it is a good practice.
I think your code is comparing the wrong things
You start by finding the last character:
char = text[-1]
Then you compare this to itself:
for i in range(1, len(text)):
if text[-i] == char: #<-- surely this is test[-1] to begin with?
Why not just run through the characters:
def measure_normalized_emphasis(text):
char = text[0]
emphasis_size = 0
for i in range(1, len(text)):
if text[i] == char:
emphasis_size += 1
else:
char = text[i]
return emphasis_size
This seems to work.

Finding regular expression with at least one repetition of each letter

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?
You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)
Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']

Removing a character in a string one at a time

Basically I want to remove a character in a string one at a time if it occurs multiple times .
For eg :- if I have a word abaccea and character 'a' then the output of the function should be baccea , abacce , abccea.
I read that I can make maketrans for a and empty string but it replaces every a in the string.
Is there an efficient way to do this besides noting all the positions in a list and then replacing and generating the words ??
Here is a quick way of doing it:
In [6]: s = "abaccea"
In [9]: [s[:key] + s[key+1:] for key,val in enumerate(s) if val == "a"]
Out[10]: ['baccea', 'abccea', 'abacce']
There is the benefit of being able to turn this into a generator by simpling replacing square brackets with round ones.
You could try the following script. It provides a simple function to do what you ask. The use of list comprehensions [x for x in y if something(x)] is well worth learning.
#!/usr/bin/python
word = "abaccea"
letter = "a"
def single_remove(word, letter):
"""Remove character c from text t one at a time
"""
indexes = [c for c in xrange(len(word)) if word[c] == letter]
return [word[:i] + word[i + 1:] for i in indexes]
print single_remove(word, letter)
returns ['baccea', 'abccea', 'abacce']
Cheers
I'd say that your approach sounds good - it is a reasonably efficient way to do it and it will be clear to the reader what you are doing.
However a slightly less elegant but possibly faster alternative is to use the start parameter of the find function.
i = 0
while True:
j = word.find('a', i)
if j == -1:
break
print word[:j] + word[j+1:]
i = j + 1
The find function is likely to be highly optimized in C, so this may give you a performance improvement compared to iterating over the characters in the string yourself in Python. Whether you want to do this though depends on whether you are looking for efficiency or elegance. I'd recommend going for the simple and clear approach first, and only optimizing it if performance profiling shows that efficiency is an important issue.
Here are some performance measurements showing that the code using find can run faster:
>>> method1='[s[:key] + s[key+1:] for key,val in enumerate(s) if val == "a"]'
>>> method2='''
result=[]
i = 0
while True:
j = s.find('a', i)
if j == -1:
break
result.append(s[:j] + s[j+1:])
i = j + 1
'''
>>> timeit.timeit(method1, init, number=100000)
2.5391986271997666
>>> timeit.timeit(method2, init, number=100000)
1.1471052885212885
how about this ?
>>> def replace_a(word):
... word = word[1:8]
... return word
...
>>> replace_a("abaccea")
'baccea'
>>>

Categories