Finding a sequence of characters in string - python

Using python, I am trying to find any sequence of characters in a string by specifying the length of this chain of characters.
For Example, if we have the following variable, I want to extract any identical sequence of characters with a length of 5:
x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
the result should be:
11111
11111
how can I do that?

itertools to the rescue :)
>>> import itertools
>>> val = 5
>>> x
'jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111'
>>> [y[0]*val for y in itertools.groupby(x) if len(list(y[1])) == val]
['11111', '11111']
Edit: naming well
>>> [char*val for char,grouper in itertools.groupby(x) if len(list(grouper)) == val]
['11111', '11111']
Or the more memory efficient oneliner suggested by #Chris_Rands
>>> [k*val for k, g in itertools.groupby(x) if sum(1 for _ in g) == val]

Or if you are fine with using regex, makes your code a lot cleaner:
[row[0] for row in re.findall(r'((.)\2{4,})', s)]
regex101 - example

The original answer (below) is for a different problem (identifying repeated patterns of n characters in the string). Here is one possible one liner to solve the problem:
x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
n = 5
res = [x[i:i + n] for i, c in enumerate(x) if x[i:i + n] == c * n]
print(res)
# ['11111', '11111']
Original (wrong) answer
Using Counter:
from collections import Counter
x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
n = 5
c = Counter(x[i:i + n] for i in range(len(x) - n + 1))
for k, v in c.items():
if v > 1:
print(*([k] * v), sep='\n')
Output:
**111
**111
*1111
*1111
11111
11111
1111*
1111*
111**
111**

Very ugly solution :-)
x = "jhg**11111**jjhgj**11111**klhhkjh22222jhjkh1111"
for c, i in enumerate(x):
if i == x[c+1:c+2] and i == x[c+2:c+3] and i == x[c+3:c+4] and i == x[c+4:c+5]:
print(x[c:c+5])

try this:
x = "jhg**11111**jjhgj**11111**klhhkjh111ljhjkh1111"
seq_length = 5
for item in set(x):
if seq_length*item in x:
for i in range(x.count(seq_length*item)):
print(seq_length*item)
it works by leveraging set() to easily construct the sequence you're looking for and then searches for it in the text
outputs your desired output:
11111
11111

Let's change a little your source string:
x = "jhg**11111**jjhgj**22222**klhhkjh33333jhjkh44444"
The regex should be:
pat = r'(.)\1{4}'
Here you have a capturing group (a single char) and a backreference
to it (4 times), so totally the same char must occur 5 times.
One variant to print the result, although less intuitive is:
res = re.findall(pat, x)
print(res)
But the above code prints:
['1', '2', '3', '4']
i.e. a list, where each position is only the capturing group (in our case
the first char), not the whole match.
So I propose also the second variant, with finditer and
printing both start position and the whole match:
for match in re.finditer(pat, x):
print('{:2d}: {}'.format(match.start(), match.group()))
For the above data the result is:
5: 11111
19: 22222
33: 33333
43: 44444

Related

How to replace all occurrences of "00000" with "0" repeatedly?

I need to repeatedly replace all occurrence of 00000 with 0 in a binary string input.
Although I'm able to achieve it to some extent, I do not know the logic when there are multiple consecutive 00000s like for example:
25 0s should be replaced with one 0
50 0s should be replaced with two 0s
125 0s should be replaced with one 0
Currently I have following code :
new_list = []
c = 0
l = list(s.split("00000"))
print(l)
for i in l:
if i == "00000":
for x in range(l.index(i),l.index(i-3)):
if l[x] != 0:
break
for y in range(0,5):
del l[i-y]
new_list.append(i)
new_list.append("0")
r_list = new_list[0:-1]
r_list= ''.join(map(str, r_list))
print(r_list)
But this will not work for 25 0s.
Also What would be the regex alternative for this ?
To get those results, you would need to repeatedly replace five consecutive zeroes to one zero, until there is no more occurrence of five consecutive zeroes. Here is an example run:
s = "0" * 125 # example input
while "00000" in s:
s = s.replace("00000", "0")
print(s)
As I state in my comment, my best guess at what you're trying to do is that you're trying to repeatedly apply the rule that 50's get replaced with 1, so that, for example, 25 0's get reduced to 00000, which in turn gets reduced to 0. Assuming that's correct:
It's not the most efficient approach, but here's one way to do it:
import re
new = "00000100002000003000000004" + "0"*50
old = ""
while old != new:
old,new = new,re.sub("0{5}","0",new)
print(new) #0100002030000400
Alternatively, here's a method to apply that change in one pass through the array:
s = "00000100002000003000000004" + "0"*50
stack,ct = ['#'],[-1]
i = 0
while i < len(s):
if s[i] == stack[-1]:
ct[-1] += 1
i+=1
elif ct[-1] >= 5:
q,r = divmod(ct[-1],5)
ct[-1] = q+r
else:
stack.append(s[i])
ct.append(1)
i+=1
while ct[-1] >= 5:
q,r = divmod(ct[-1],5)
ct[-1] = q+r
ans = "".join(c*k for c,k in zip(stack[1:],ct[1:]))
print(ans)
PyPI regex supports recursion. Something like this could do:
import regex as re
s = re.sub(r"0000(?:(?0)|0)", "0", s)
See this Python demo at tio.run or the regex demo at regex101
At (?0) or alternatively (?R) the pattern gets pasted (recursed).

Python - removing repeated letters in a string

Say I have a string in alphabetical order, based on the amount of times that a letter repeats.
Example: "BBBAADDC".
There are 3 B's, so they go at the start, 2 A's and 2 D's, so the A's go in front of the D's because they are in alphabetical order, and 1 C. Another example would be CCCCAAABBDDAB.
Note that there can be 4 letters in the middle somewhere (i.e. CCCC), as there could be 2 pairs of 2 letters.
However, let's say I can only have n letters in a row. For example, if n = 3 in the second example, then I would have to omit one "C" from the first substring of 4 C's, because there can only be a maximum of 3 of the same letters in a row.
Another example would be the string "CCCDDDAABC"; if n = 2, I would have to remove one C and one D to get the string CCDDAABC
Example input/output:
n=2: Input: AAABBCCCCDE, Output: AABBCCDE
n=4: Input: EEEEEFFFFGGG, Output: EEEEFFFFGGG
n=1: Input: XXYYZZ, Output: XYZ
How can I do this with Python? Thanks in advance!
This is what I have right now, although I'm not sure if it's on the right track. Here, z is the length of the string.
for k in range(z+1):
if final_string[k] == final_string[k+1] == final_string[k+2] == final_string[k+3]:
final_string = final_string.translate({ord(final_string[k]): None})
return final_string
Ok, based on your comment, you're either pre-sorting the string or it doesn't need to be sorted by the function you're trying to create. You can do this more easily with itertools.groupby():
import itertools
def max_seq(text, n=1):
result = []
for k, g in itertools.groupby(text):
result.extend(list(g)[:n])
return ''.join(result)
max_seq('AAABBCCCCDE', 2)
# 'AABBCCDE'
max_seq('EEEEEFFFFGGG', 4)
# 'EEEEFFFFGGG'
max_seq('XXYYZZ')
# 'XYZ'
max_seq('CCCDDDAABC', 2)
# 'CCDDAABC'
In each group g, it's expanded and then sliced until n elements (the [:n] part) so you get each letter at most n times in a row. If the same letter appears elsewhere, it's treated as an independent sequence when counting n in a row.
Edit: Here's a shorter version, which may also perform better for very long strings. And while we're using itertools, this one additionally utilises itertools.chain.from_iterable() to create the flattened list of letters. And since each of these is a generator, it's only evaluated/expanded at the last line:
import itertools
def max_seq(text, n=1):
sequences = (list(g)[:n] for _, g in itertools.groupby(text))
letters = itertools.chain.from_iterable(sequences)
return ''.join(letters)
hello = "hello frrriend"
def replacing() -> str:
global hello
j = 0
for i in hello:
if j == 0:
pass
else:
if i == prev:
hello = hello.replace(i, "")
prev = i
prev = i
j += 1
return hello
replacing()
looks a bit primal but i think it works, thats what i came up with on the go anyways , hope it helps :D
Here's my solution:
def snip_string(string, n):
list_string = list(string)
list_string.sort()
chars = set(string)
for char in chars:
while list_string.count(char) > n:
list_string.remove(char)
return ''.join(list_string)
Calling the function with various values for n gives the following output:
>>> string = "AAAABBBCCCDDD"
>>> snip_string(string, 1)
'ABCD'
>>> snip_string(string, 2)
'AABBCCDD'
>>> snip_string(string, 3)
'AAABBBCCCDDD'
>>>
Edit
Here is the updated version of my solution, which only removes characters if the group of repeated characters exceeds n.
import itertools
def snip_string(string, n):
groups = [list(g) for k, g in itertools.groupby(string)]
string_list = []
for group in groups:
while len(group) > n:
del group[-1]
string_list.extend(group)
return ''.join(string_list)
Output:
>>> string = "DDDAABBBBCCABCDE"
>>> snip_string(string, 3)
'DDDAABBBCCABCDE'
from itertools import groupby
n = 2
def rem(string):
out = "".join(["".join(list(g)[:n]) for _, g in groupby(string)])
print(out)
So this is the entire code for your question.
s = "AABBCCDDEEE"
s2 = "AAAABBBDDDDDDD"
s3 = "CCCCAAABBDDABBB"
s4 = "AAAAAAAA"
z = "AAABBCCCCDE"
With following test:
AABBCCDDEE
AABBDD
CCAABBDDABB
AA
AABBCCDE

Remove N consecutive repeated characters in a string

I am trying to solve a problem where the user inputs a string say str = "aaabbcc" and an integer n = 2.
So the function is supposed to remove characters that appearing 'n' times from the str and output only "aaa".
I tried couple of approaches and I'm not able to obtain the right output.
Are there any Regular expression functions that I could use or any recursive functions or just plain old iterations.
Thanks in advance.
Using itertools.groupby
Ex:
from itertools import groupby
s = "aaabbcc"
n = 2
result = ""
for k, v in groupby(s):
value = list(v)
if not len(value) == n:
result += "".join(value)
print(result)
Output:
aaa
You can use itertools.groupby:
>>> s = "aaabbccddddddddddeeeee"
>>> from itertools import groupby
>>> n = 3
>>> groups = (list(values) for _, values in groupby(s))
>>> "".join("".join(v) for v in groups if len(v) < n)
'bbcc'
from collections import Counter
counts = Counter(string)
string = "".join(c for c in string if counts[c] != 2)
Edit: Wait, sorry, I missed "consecutive". This will remove characters that occur exactly two times in the whole string (fitting your example, but not the general case).
Consecutive filter is a bit more complex, but doable - just find the consecutive runs first, then filter out the ones which have length two.
runs = [[string[0], 0]]
for c in string:
if c == runs[-1][0]:
runs[-1][1] += 1
else:
runs.append([c, 1])
string = "".join(c*length for c,length in runs if length != 2)
Edit2: As the other answers correctly point out, the first part of this is done natively by groupby
from itertools import groupby
string = "".join(c*length for c,length in groupby(string) if length != 2)
In [15]: some_string = 'aaabbcc'
In [16]: n = 2
In [17]: final_string = ''
In [18]: for k, v in Counter(some_string).items():
...: if v != n:
...: final_string += k * v
...:
In [19]: final_string
Out[19]: 'aaa'
You'll need: from collections import Counter
from collections import defaultdict
def fun(string,n):
dic = defaultdict(int)
for i in string:
dic[i]+=1
check = []
for i in dic:
if dic[i]==n:
check.append(i)
for i in check:
del dic[i]
return dic
string = "aaabbcc"
n = 2
result = fun(string, n)
sol =''
for i in result:
sol+=i*result[i]
print(sol)
output
aaa

Splitting an unspaced string of decimal values - Python

An awful person has given me a string like this
values = '.850000.900000.9500001.000001.50000'
and I need to split it to create the following list:
['.850000', '.900000', '.950000', '1.00000', '1.500000']
I know that I was dealing only with numbers < 1 I could use the code
dl = '.'
splitvalues = [dl+e for e in values.split(dl) if e != ""]
But in cases like this one where there are numbers greater than 1 buried in the string, splitvalue would end up being
['.850000', '.900000', '.9500001', '.000001', '.50000']
So is there a way to split a string with multiple delimiters while also splitting the string differently based on which delimiter is encountered?
I think this is somewhat closer to a fixed width format string. Try a regular expression like this:
import re
str = "(\d{1,2}\\.\d{5})"
m = re.search(str, input_str)
your_first_number = m.group(0)
Try this repeatedly on the remaining string to consume all numbers.
>>> import re
>>> source = '0.850000.900000.9500001.000001.50000'
>>> re.findall("(.*?00+(?!=0))", source)
['0.850000', '.900000', '.950000', '1.00000', '1.50000']
The split is based on looking for "{anything, double zero, a run of zeros (followed by a not-zero)"}.
Assume that the value before the decimal is less than 10, and then we have,
values = '0.850000.900000.9500001.000001.50000'
result = list()
last_digit = None
for value in values.split('.'):
if value.endswith('0'):
result.append(''.join([i for i in [last_digit, '.', value] if i]))
last_digit = None
else:
result.append(''.join([i for i in [last_digit, '.', value[0:-1]] if i]))
last_digit = value[-1]
if values.startswith('0'):
result = result[1:]
print(result)
# Output
['.850000', '.900000', '.950000', '1.00000', '1.50000']
How about using re.split():
import re
values = '0.850000.900000.9500001.000001.50000'
print([a + b for a, b in zip(*(lambda x: (x[1::2], x[2::2]))(re.split(r"(\d\.)", values)))])
OUTPUT
['0.85000', '0.90000', '0.950000', '1.00000', '1.50000']
Here digits are of fixed width, i.e. 6, if include the dot it's 7. Get the slices from 0 to 7 and 7 to 14 and so on. Because we don't need the initial zero, I use the slice values[1:] for extraction.
values = '0.850000.900000.9500001.000001.50000'
[values[1:][start:start+7] for start in range(0,len(values[1:]),7)]
['.850000', '.900000', '.950000', '1.00000', '1.50000']
Test;
''.join([values[1:][start:start+7] for start in range(0,len(values[1:]),7)]) == values[1:]
True
With a fixed / variable string, you may try something like:
values = '0.850000.900000.9500001.000001.50000'
str_list = []
first_index = values.find('.')
while first_index > 0:
last_index = values.find('.', first_index + 1)
if last_index != -1:
str_list.append(values[first_index - 1: last_index - 2])
first_index = last_index
else:
str_list.append(values[first_index - 1: len(values) - 1])
break
print str_list
Output:
['0.8500', '0.9000', '0.95000', '1.0000', '1.5000']
Assuming that there will always be a single digit before the decimal.
Please take this as a starting point and not a copy paste solution.

Overlapping count of substring in a string in Python

I want to find all the counts (overlapping and non-overlapping) of a sub-string in a string.
I found two answers one of which is using regex which is not my intention and the other was much more in-efficient than I need.
I need something like:
'ababaa'.count('aba') == 2
str.count() just counts simple substrings. What should I do?
def sliding(a, n):
return (a[i:i+n] for i in xrange(len(a) - n + 1))
def substring_count(a, b):
return sum(s == b for s in sliding(a, len(b)))
assert list(sliding('abcde', 3)) == ['abc', 'bcd', 'cde']
assert substring_count('ababaa', 'aba') == 2
count = len(set([string.find('aba',x) for x in range(len(string)) if string.find('aba',x) >= 0]))
Does this do the trick?
def count(string, substring):
n = len(substring)
cnt = 0
for i in range(len(string) - n):
if string[i:i+n] == substring:
cnt += 1
return cnt
print count('ababaa', 'aba') # 2
I don't know if there's a more efficient solution, but this should work.
Here, using re.finditer() is the best way to achieve what you want.
import re
def get_substring_count(s, sub_s):
return sum(1 for m in re.finditer('(?=%s)' % sub_s, s))
get_substring_count('ababaa', 'aba')
# 2 as response
Here's a function you could use:
def count(haystack, needle):
return len([x for x in [haystack[i:j+1] for i in xrange(len(haystack)) for j in xrange(i,len(haystack))] if x == needle])
Then:
>>> count("ababaa", "aba")
2
A brute-force approach is just
n = len(needle)
count = sum(haystack[i:i+n] == needle for i in range(len(haystack)-n+1))
(this works because in Python True and False are equivalent to numbers 1 and 0 for most uses, including math).
Using a regexp instead it could be
count = len(re.findall(needle[:1]+"(?="+re.escape(needle[1:])+")",
haystack))
(i.e. using a(?=ba) instead of aba to find overlapping matches too)
Looping through sliced string
def count_substring(string, sub_string):
l = len(sub_string)
n = len(string)
count = sum(1 for i in range(n-l+1) if string[i:i+l].count(sub_string)>0 )
return count
Another way to consider is by leveraging the Counter container. While the accepted answer is fastest for shorter strings, if you are searching relatively short substrings within long strings the Counter approach starts to take the edge. Also, if you have need to refactor this to perform multiple substring count queries against the same main string, then the Counter approach starts looking much more attractive
For example, searching for a substring of length = 3 gave me the following results using timeit;
Main string length / Accepted Answer / Counter Approach
6 characters / 4.1us / 7.4us
50 characters / 24.4us / 25us
150 characters / 70.7us / 64.9us
1500 characters / 723us / 614us
from collections import Counter
def count_w_overlap(search_string, main_string):
#Split up main_string into all possible overlap possibilities
search_len = len(search_string)
candidates = [main_string[i:i+search_len] for i in range(0, len(main_string) - search_len + 1)]
#Create the Counter container
freq_count = Counter(candidates)
return freq_count[search_string]

Categories