Remove N consecutive repeated characters in a string - python

I am trying to solve a problem where the user inputs a string say str = "aaabbcc" and an integer n = 2.
So the function is supposed to remove characters that appearing 'n' times from the str and output only "aaa".
I tried couple of approaches and I'm not able to obtain the right output.
Are there any Regular expression functions that I could use or any recursive functions or just plain old iterations.
Thanks in advance.

Using itertools.groupby
Ex:
from itertools import groupby
s = "aaabbcc"
n = 2
result = ""
for k, v in groupby(s):
value = list(v)
if not len(value) == n:
result += "".join(value)
print(result)
Output:
aaa

You can use itertools.groupby:
>>> s = "aaabbccddddddddddeeeee"
>>> from itertools import groupby
>>> n = 3
>>> groups = (list(values) for _, values in groupby(s))
>>> "".join("".join(v) for v in groups if len(v) < n)
'bbcc'

from collections import Counter
counts = Counter(string)
string = "".join(c for c in string if counts[c] != 2)
Edit: Wait, sorry, I missed "consecutive". This will remove characters that occur exactly two times in the whole string (fitting your example, but not the general case).
Consecutive filter is a bit more complex, but doable - just find the consecutive runs first, then filter out the ones which have length two.
runs = [[string[0], 0]]
for c in string:
if c == runs[-1][0]:
runs[-1][1] += 1
else:
runs.append([c, 1])
string = "".join(c*length for c,length in runs if length != 2)
Edit2: As the other answers correctly point out, the first part of this is done natively by groupby
from itertools import groupby
string = "".join(c*length for c,length in groupby(string) if length != 2)

In [15]: some_string = 'aaabbcc'
In [16]: n = 2
In [17]: final_string = ''
In [18]: for k, v in Counter(some_string).items():
...: if v != n:
...: final_string += k * v
...:
In [19]: final_string
Out[19]: 'aaa'
You'll need: from collections import Counter

from collections import defaultdict
def fun(string,n):
dic = defaultdict(int)
for i in string:
dic[i]+=1
check = []
for i in dic:
if dic[i]==n:
check.append(i)
for i in check:
del dic[i]
return dic
string = "aaabbcc"
n = 2
result = fun(string, n)
sol =''
for i in result:
sol+=i*result[i]
print(sol)
output
aaa

Related

Getting all possible combinations of a regular expression string

I have a regular expression: ATG(C|G|A)(C|T)GA
The above regular expression could take any form with only OR (|) special characters at any position in the string and any number of alphabets within the brackets.
I want to match all combinations of this string in a list:
ATGCCGA
ATGCTGA
ATGGCGA
ATGGTGA
ATGACGA
ATGATGA
I am unable to find any python library that could do this.
You could take the cartesian product of the dynamic parts of the string using itertools.product then join with the other static parts of the string.
>>> from itertools import product
>>> [f'ATG{i}{j}GA' for i,j in product('CGA', 'CT')]
['ATGCCGA', 'ATGCTGA', 'ATGGCGA', 'ATGGTGA', 'ATGACGA', 'ATGATGA']
You can use recursion:
import collections
s = 'ATG(C|G|A)(C|T)GA'
def combos(d):
r, k = [], None
while d:
if (c:=d.popleft()) not in '|()':
k = (k if k else '')+c
elif c == '|':
if k:
r.append(k)
k = None
elif c == '(':
r = [v+(k or '')+i for i in combos(d) for v in (r if r else [''])]
k = None
else:
if k:
r.append(k)
k = None
break
yield from ([i+(k or '') for i in r] if r else [k])
print(list(combos(collections.deque(list(s)))))
Output:
['ATGCCGA', 'ATGGCGA', 'ATGACGA', 'ATGCTGA', 'ATGGTGA', 'ATGATGA']

Python - removing repeated letters in a string

Say I have a string in alphabetical order, based on the amount of times that a letter repeats.
Example: "BBBAADDC".
There are 3 B's, so they go at the start, 2 A's and 2 D's, so the A's go in front of the D's because they are in alphabetical order, and 1 C. Another example would be CCCCAAABBDDAB.
Note that there can be 4 letters in the middle somewhere (i.e. CCCC), as there could be 2 pairs of 2 letters.
However, let's say I can only have n letters in a row. For example, if n = 3 in the second example, then I would have to omit one "C" from the first substring of 4 C's, because there can only be a maximum of 3 of the same letters in a row.
Another example would be the string "CCCDDDAABC"; if n = 2, I would have to remove one C and one D to get the string CCDDAABC
Example input/output:
n=2: Input: AAABBCCCCDE, Output: AABBCCDE
n=4: Input: EEEEEFFFFGGG, Output: EEEEFFFFGGG
n=1: Input: XXYYZZ, Output: XYZ
How can I do this with Python? Thanks in advance!
This is what I have right now, although I'm not sure if it's on the right track. Here, z is the length of the string.
for k in range(z+1):
if final_string[k] == final_string[k+1] == final_string[k+2] == final_string[k+3]:
final_string = final_string.translate({ord(final_string[k]): None})
return final_string
Ok, based on your comment, you're either pre-sorting the string or it doesn't need to be sorted by the function you're trying to create. You can do this more easily with itertools.groupby():
import itertools
def max_seq(text, n=1):
result = []
for k, g in itertools.groupby(text):
result.extend(list(g)[:n])
return ''.join(result)
max_seq('AAABBCCCCDE', 2)
# 'AABBCCDE'
max_seq('EEEEEFFFFGGG', 4)
# 'EEEEFFFFGGG'
max_seq('XXYYZZ')
# 'XYZ'
max_seq('CCCDDDAABC', 2)
# 'CCDDAABC'
In each group g, it's expanded and then sliced until n elements (the [:n] part) so you get each letter at most n times in a row. If the same letter appears elsewhere, it's treated as an independent sequence when counting n in a row.
Edit: Here's a shorter version, which may also perform better for very long strings. And while we're using itertools, this one additionally utilises itertools.chain.from_iterable() to create the flattened list of letters. And since each of these is a generator, it's only evaluated/expanded at the last line:
import itertools
def max_seq(text, n=1):
sequences = (list(g)[:n] for _, g in itertools.groupby(text))
letters = itertools.chain.from_iterable(sequences)
return ''.join(letters)
hello = "hello frrriend"
def replacing() -> str:
global hello
j = 0
for i in hello:
if j == 0:
pass
else:
if i == prev:
hello = hello.replace(i, "")
prev = i
prev = i
j += 1
return hello
replacing()
looks a bit primal but i think it works, thats what i came up with on the go anyways , hope it helps :D
Here's my solution:
def snip_string(string, n):
list_string = list(string)
list_string.sort()
chars = set(string)
for char in chars:
while list_string.count(char) > n:
list_string.remove(char)
return ''.join(list_string)
Calling the function with various values for n gives the following output:
>>> string = "AAAABBBCCCDDD"
>>> snip_string(string, 1)
'ABCD'
>>> snip_string(string, 2)
'AABBCCDD'
>>> snip_string(string, 3)
'AAABBBCCCDDD'
>>>
Edit
Here is the updated version of my solution, which only removes characters if the group of repeated characters exceeds n.
import itertools
def snip_string(string, n):
groups = [list(g) for k, g in itertools.groupby(string)]
string_list = []
for group in groups:
while len(group) > n:
del group[-1]
string_list.extend(group)
return ''.join(string_list)
Output:
>>> string = "DDDAABBBBCCABCDE"
>>> snip_string(string, 3)
'DDDAABBBCCABCDE'
from itertools import groupby
n = 2
def rem(string):
out = "".join(["".join(list(g)[:n]) for _, g in groupby(string)])
print(out)
So this is the entire code for your question.
s = "AABBCCDDEEE"
s2 = "AAAABBBDDDDDDD"
s3 = "CCCCAAABBDDABBB"
s4 = "AAAAAAAA"
z = "AAABBCCCCDE"
With following test:
AABBCCDDEE
AABBDD
CCAABBDDABB
AA
AABBCCDE

Finding a sequence of characters in string

Using python, I am trying to find any sequence of characters in a string by specifying the length of this chain of characters.
For Example, if we have the following variable, I want to extract any identical sequence of characters with a length of 5:
x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
the result should be:
11111
11111
how can I do that?
itertools to the rescue :)
>>> import itertools
>>> val = 5
>>> x
'jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111'
>>> [y[0]*val for y in itertools.groupby(x) if len(list(y[1])) == val]
['11111', '11111']
Edit: naming well
>>> [char*val for char,grouper in itertools.groupby(x) if len(list(grouper)) == val]
['11111', '11111']
Or the more memory efficient oneliner suggested by #Chris_Rands
>>> [k*val for k, g in itertools.groupby(x) if sum(1 for _ in g) == val]
Or if you are fine with using regex, makes your code a lot cleaner:
[row[0] for row in re.findall(r'((.)\2{4,})', s)]
regex101 - example
The original answer (below) is for a different problem (identifying repeated patterns of n characters in the string). Here is one possible one liner to solve the problem:
x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
n = 5
res = [x[i:i + n] for i, c in enumerate(x) if x[i:i + n] == c * n]
print(res)
# ['11111', '11111']
Original (wrong) answer
Using Counter:
from collections import Counter
x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
n = 5
c = Counter(x[i:i + n] for i in range(len(x) - n + 1))
for k, v in c.items():
if v > 1:
print(*([k] * v), sep='\n')
Output:
**111
**111
*1111
*1111
11111
11111
1111*
1111*
111**
111**
Very ugly solution :-)
x = "jhg**11111**jjhgj**11111**klhhkjh22222jhjkh1111"
for c, i in enumerate(x):
if i == x[c+1:c+2] and i == x[c+2:c+3] and i == x[c+3:c+4] and i == x[c+4:c+5]:
print(x[c:c+5])
try this:
x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
seq_length = 5
for item in set(x):
if seq_length*item in x:
for i in range(x.count(seq_length*item)):
print(seq_length*item)
it works by leveraging set() to easily construct the sequence you're looking for and then searches for it in the text
outputs your desired output:
11111
11111
Let's change a little your source string:
x = "jhg**11111**jjhgj**22222**klhhkjh33333jhjkh44444"
The regex should be:
pat = r'(.)\1{4}'
Here you have a capturing group (a single char) and a backreference
to it (4 times), so totally the same char must occur 5 times.
One variant to print the result, although less intuitive is:
res = re.findall(pat, x)
print(res)
But the above code prints:
['1', '2', '3', '4']
i.e. a list, where each position is only the capturing group (in our case
the first char), not the whole match.
So I propose also the second variant, with finditer and
printing both start position and the whole match:
for match in re.finditer(pat, x):
print('{:2d}: {}'.format(match.start(), match.group()))
For the above data the result is:
5: 11111
19: 22222
33: 33333
43: 44444

Loop through json array in string python

I have a string and I created a JSON array which contains strings and values:
amount = 0
a = "asaf,almog,arnon,elbar"
values_li={'asaf':'1','almog':'6','elbar':'2'}
How can I create a loop that will search all items on values_li in a and for each item it will find it will do
amount = amount + value(the value that found from value_li in a)
I tried to do this but it doesn't work:
for k,v in values_li.items():
if k in a:
amount = amount + v
It's working.
I figure out my problem.
v is a string and I tried to do math with a string so I had to convert v to an int
amount = amount + int(v)
Now It's working :)
You should be careful using:
if k in a:
a is the string: "asaf,almog,arnon,elbar" not a list. This means that:
"bar" in a # == True
"as" in a # == True
..etc Which is probably not what you want.
You should consider splitting it into an array, then you'll only get complete matches. With that you can simply use:
a = "asaf,almog,arnon,elbar".split(',')
values_li={'asaf':'1','almog':'6','elbar':'2'}
amount = sum([int(values_li[k]) for k in a if k in values_li])
# 9
collections.Counter() is your friend:
from collections import Counter
a = "asaf,almog,arnon,elbar"
values_li = Counter({'asaf':1,'almog':6,'elbar':2})
values_li.update(a.split(','))
values_li
That will result in:
Counter({'almog': 7, 'elbar': 3, 'asaf': 2, 'arnon': 1})
And if you want the sum of all values in values_li, you can simply do:
sum(values_li.values())
Which will result in 13, for the key/value pairs in your example.

Compare strings in list in Python and output character until they are identical

How can I compare all strings in a list e.g:
"A-B-C-D-E-F-H-A",
"A-B-C-F-G-H-M-P",
And output until which character they are identical:
In the example above it would be:
Character 6
And output the most similar strings.
I tried with collections.Counter but that did not work.
You're trying to go character by character in the two strings in lockstep. This is a job for zip:
A = "A-B-C-D-E-F-H-A"
B = "A-B-C-F-G-H-M-P"
count = 0
for a, b in zip(A, B):
if a == b:
count += 1
else:
break
Or, if you prefer "…as long as they are…" is a job for takewhile:
from itertools import takewhile
from operator import eq
def ilen(iterable): return sum(1 for _ in iterable)
count = ilen(takewhile(lambda ab: eq(*ab), zip(A, B)))
If you have a list of these strings, and you want to compare every string to every other string:
First, you turn the above code into a function. I'll do it with the itertools version, but you can do it with the other just as easily:
def shared_prefix(A, B):
return ilen(takewhile(lambda ab: eq(*ab), zip(A, B)))
Now, for every string, you compare it to all the rest of the strings. There's an easy way to do it with combinations:
from itertools import combinations
counts = [shared_prefix(pair) for pair in combinations(list_o_strings, 2)]
But if you don't understand that, you can write it as a nested loop. The only tricky part is what "the rest of the strings" means. You can't loop over all the strings in both the outer and inner loops, or you'll compare each pair of strings twice (once in each order), and compare each string to itself. So it has to mean "all the strings after the current one". Like this:
counts = []
for i, s1 in enumerate(list_o_strings):
for s2 in list_o_strings[i+1:]:
counts.append(prefix(s1, s2))
I think this code will solve your problem.
listA = "A-B-C-D-E-F-H-A"
listB = "A-B-C-F-G-H-M-P"
newListA = listA.replace ("-", "")
newListB = listB.replace ("-", "")
# newListA = "ABCDEFHA"
# newListB = "ABCFGHMP"
i = 0
exit = 0
while ((i < len (newListA)) & (exit == 0)):
if (newListA[i] != newListB[i]):
exit = 1
i = i + 1
print ("Character: " + str(i))

Categories