Overlapping count of substring in a string in Python - python

I want to find all the counts (overlapping and non-overlapping) of a sub-string in a string.
I found two answers one of which is using regex which is not my intention and the other was much more in-efficient than I need.
I need something like:
'ababaa'.count('aba') == 2
str.count() just counts simple substrings. What should I do?

def sliding(a, n):
return (a[i:i+n] for i in xrange(len(a) - n + 1))
def substring_count(a, b):
return sum(s == b for s in sliding(a, len(b)))
assert list(sliding('abcde', 3)) == ['abc', 'bcd', 'cde']
assert substring_count('ababaa', 'aba') == 2

count = len(set([string.find('aba',x) for x in range(len(string)) if string.find('aba',x) >= 0]))

Does this do the trick?
def count(string, substring):
n = len(substring)
cnt = 0
for i in range(len(string) - n):
if string[i:i+n] == substring:
cnt += 1
return cnt
print count('ababaa', 'aba') # 2
I don't know if there's a more efficient solution, but this should work.

Here, using re.finditer() is the best way to achieve what you want.
import re
def get_substring_count(s, sub_s):
return sum(1 for m in re.finditer('(?=%s)' % sub_s, s))
get_substring_count('ababaa', 'aba')
# 2 as response

Here's a function you could use:
def count(haystack, needle):
return len([x for x in [haystack[i:j+1] for i in xrange(len(haystack)) for j in xrange(i,len(haystack))] if x == needle])
>>> count("ababaa", "aba")

A brute-force approach is just
n = len(needle)
count = sum(haystack[i:i+n] == needle for i in range(len(haystack)-n+1))
(this works because in Python True and False are equivalent to numbers 1 and 0 for most uses, including math).
Using a regexp instead it could be
count = len(re.findall(needle[:1]+"(?="+re.escape(needle[1:])+")",
(i.e. using a(?=ba) instead of aba to find overlapping matches too)

Looping through sliced string
def count_substring(string, sub_string):
l = len(sub_string)
n = len(string)
count = sum(1 for i in range(n-l+1) if string[i:i+l].count(sub_string)>0 )
return count

Another way to consider is by leveraging the Counter container. While the accepted answer is fastest for shorter strings, if you are searching relatively short substrings within long strings the Counter approach starts to take the edge. Also, if you have need to refactor this to perform multiple substring count queries against the same main string, then the Counter approach starts looking much more attractive
For example, searching for a substring of length = 3 gave me the following results using timeit;
Main string length / Accepted Answer / Counter Approach
6 characters / 4.1us / 7.4us
50 characters / 24.4us / 25us
150 characters / 70.7us / 64.9us
1500 characters / 723us / 614us
from collections import Counter
def count_w_overlap(search_string, main_string):
#Split up main_string into all possible overlap possibilities
search_len = len(search_string)
candidates = [main_string[i:i+search_len] for i in range(0, len(main_string) - search_len + 1)]
#Create the Counter container
freq_count = Counter(candidates)
return freq_count[search_string]


How to replace all occurrences of "00000" with "0" repeatedly?

I need to repeatedly replace all occurrence of 00000 with 0 in a binary string input.
Although I'm able to achieve it to some extent, I do not know the logic when there are multiple consecutive 00000s like for example:
25 0s should be replaced with one 0
50 0s should be replaced with two 0s
125 0s should be replaced with one 0
Currently I have following code :
new_list = []
c = 0
l = list(s.split("00000"))
for i in l:
if i == "00000":
for x in range(l.index(i),l.index(i-3)):
if l[x] != 0:
for y in range(0,5):
del l[i-y]
r_list = new_list[0:-1]
r_list= ''.join(map(str, r_list))
But this will not work for 25 0s.
Also What would be the regex alternative for this ?
To get those results, you would need to repeatedly replace five consecutive zeroes to one zero, until there is no more occurrence of five consecutive zeroes. Here is an example run:
s = "0" * 125 # example input
while "00000" in s:
s = s.replace("00000", "0")
As I state in my comment, my best guess at what you're trying to do is that you're trying to repeatedly apply the rule that 50's get replaced with 1, so that, for example, 25 0's get reduced to 00000, which in turn gets reduced to 0. Assuming that's correct:
It's not the most efficient approach, but here's one way to do it:
import re
new = "00000100002000003000000004" + "0"*50
old = ""
while old != new:
old,new = new,re.sub("0{5}","0",new)
print(new) #0100002030000400
Alternatively, here's a method to apply that change in one pass through the array:
s = "00000100002000003000000004" + "0"*50
stack,ct = ['#'],[-1]
i = 0
while i < len(s):
if s[i] == stack[-1]:
ct[-1] += 1
elif ct[-1] >= 5:
q,r = divmod(ct[-1],5)
ct[-1] = q+r
while ct[-1] >= 5:
q,r = divmod(ct[-1],5)
ct[-1] = q+r
ans = "".join(c*k for c,k in zip(stack[1:],ct[1:]))
PyPI regex supports recursion. Something like this could do:
import regex as re
s = re.sub(r"0000(?:(?0)|0)", "0", s)
See this Python demo at tio.run or the regex demo at regex101
At (?0) or alternatively (?R) the pattern gets pasted (recursed).

Python - removing repeated letters in a string

Say I have a string in alphabetical order, based on the amount of times that a letter repeats.
Example: "BBBAADDC".
There are 3 B's, so they go at the start, 2 A's and 2 D's, so the A's go in front of the D's because they are in alphabetical order, and 1 C. Another example would be CCCCAAABBDDAB.
Note that there can be 4 letters in the middle somewhere (i.e. CCCC), as there could be 2 pairs of 2 letters.
However, let's say I can only have n letters in a row. For example, if n = 3 in the second example, then I would have to omit one "C" from the first substring of 4 C's, because there can only be a maximum of 3 of the same letters in a row.
Another example would be the string "CCCDDDAABC"; if n = 2, I would have to remove one C and one D to get the string CCDDAABC
Example input/output:
n=1: Input: XXYYZZ, Output: XYZ
How can I do this with Python? Thanks in advance!
This is what I have right now, although I'm not sure if it's on the right track. Here, z is the length of the string.
for k in range(z+1):
if final_string[k] == final_string[k+1] == final_string[k+2] == final_string[k+3]:
final_string = final_string.translate({ord(final_string[k]): None})
return final_string
Ok, based on your comment, you're either pre-sorting the string or it doesn't need to be sorted by the function you're trying to create. You can do this more easily with itertools.groupby():
import itertools
def max_seq(text, n=1):
result = []
for k, g in itertools.groupby(text):
return ''.join(result)
max_seq('AAABBCCCCDE', 2)
max_seq('EEEEEFFFFGGG', 4)
# 'XYZ'
max_seq('CCCDDDAABC', 2)
In each group g, it's expanded and then sliced until n elements (the [:n] part) so you get each letter at most n times in a row. If the same letter appears elsewhere, it's treated as an independent sequence when counting n in a row.
Edit: Here's a shorter version, which may also perform better for very long strings. And while we're using itertools, this one additionally utilises itertools.chain.from_iterable() to create the flattened list of letters. And since each of these is a generator, it's only evaluated/expanded at the last line:
import itertools
def max_seq(text, n=1):
sequences = (list(g)[:n] for _, g in itertools.groupby(text))
letters = itertools.chain.from_iterable(sequences)
return ''.join(letters)
hello = "hello frrriend"
def replacing() -> str:
global hello
j = 0
for i in hello:
if j == 0:
if i == prev:
hello = hello.replace(i, "")
prev = i
prev = i
j += 1
return hello
looks a bit primal but i think it works, thats what i came up with on the go anyways , hope it helps :D
Here's my solution:
def snip_string(string, n):
list_string = list(string)
chars = set(string)
for char in chars:
while list_string.count(char) > n:
return ''.join(list_string)
Calling the function with various values for n gives the following output:
>>> string = "AAAABBBCCCDDD"
>>> snip_string(string, 1)
>>> snip_string(string, 2)
>>> snip_string(string, 3)
Here is the updated version of my solution, which only removes characters if the group of repeated characters exceeds n.
import itertools
def snip_string(string, n):
groups = [list(g) for k, g in itertools.groupby(string)]
string_list = []
for group in groups:
while len(group) > n:
del group[-1]
return ''.join(string_list)
>>> snip_string(string, 3)
from itertools import groupby
n = 2
def rem(string):
out = "".join(["".join(list(g)[:n]) for _, g in groupby(string)])
So this is the entire code for your question.
With following test:

count highest number of occurrence of consecutive substring in string python [duplicate]

What's the best way to count the number of occurrences of a given string, including overlap in Python? This is one way:
def function(string, str_to_search_for):
count = 0
for x in xrange(len(string) - len(str_to_search_for) + 1):
if string[x:x+len(str_to_search_for)] == str_to_search_for:
count += 1
return count
This method returns 5.
Is there a better way in Python?
Well, this might be faster since it does the comparing in C:
def occurrences(string, sub):
count = start = 0
while True:
start = string.find(sub, start) + 1
if start > 0:
return count
>>> import re
>>> text = '1011101111'
>>> len(re.findall('(?=11)', text))
If you didn't want to load the whole list of matches into memory, which would never be a problem! you could do this if you really wanted:
>>> sum(1 for _ in re.finditer('(?=11)', text))
As a function (re.escape makes sure the substring doesn't interfere with the regex):
def occurrences(text, sub):
return len(re.findall('(?={0})'.format(re.escape(sub)), text))
>>> occurrences(text, '11')
You can also try using the new Python regex module, which supports overlapping matches.
import regex as re
def count_overlapping(text, search_for):
return len(re.findall(search_for, text, overlapped=True))
count_overlapping('1011101111','11') # 5
Python's str.count counts non-overlapping substrings:
In [3]: "ababa".count("aba")
Out[3]: 1
Here are a few ways to count overlapping sequences, I'm sure there are many more :)
Look-ahead regular expressions
How to find overlapping matches with a regexp?
In [10]: re.findall("a(?=ba)", "ababa")
Out[10]: ['a', 'a']
Generate all substrings
In [11]: data = "ababa"
In [17]: sum(1 for i in range(len(data)) if data.startswith("aba", i))
Out[17]: 2
def count_substring(string, sub_string):
count = 0
for pos in range(len(string)):
if string[pos:].startswith(sub_string):
count += 1
return count
This could be the easiest way.
A fairly pythonic way would be to use list comprehension here, although it probably wouldn't be the most efficient.
sequence = 'abaaadcaaaa'
substr = 'aa'
counts = sum([
sequence.startswith(substr, i) for i in range(len(sequence))
print(counts) # 5
The list would be [False, False, True, False, False, False, True, True, False, False] as it checks all indexes through the string, and because int(True) == 1, sum gives us the total number of matches.
s = "bobobob"
sub = "bob"
ln = len(sub)
print(sum(sub == s[i:i+ln] for i in xrange(len(s)-(ln-1))))
How to find a pattern in another string with overlapping
This function (another solution!) receive a pattern and a text. Returns a list with all the substring located in the and their positions.
def occurrences(pattern, text):
input: search a pattern (regular expression) in a text
returns: a list of substrings and their positions
p = re.compile('(?=({0}))'.format(pattern))
matches = re.finditer(p, text)
return [(match.group(1), match.start()) for match in matches]
print (occurrences('ana', 'banana'))
print (occurrences('.ana', 'Banana-fana fo-fana'))
[('ana', 1), ('ana', 3)]
[('Bana', 0), ('nana', 2), ('fana', 7), ('fana', 15)]
My answer, to the bob question on the course:
s = 'azcbobobegghaklbob'
total = 0
for i in range(len(s)-2):
if s[i:i+3] == 'bob':
total += 1
print 'number of times bob occurs is: ', total
Here is my edX MIT "find bob"* solution (*find number of "bob" occurences in a string named s), which basicaly counts overlapping occurrences of a given substing:
s = 'azcbobobegghakl'
count = 0
while 'bob' in s:
count += 1
s = s[(s.find('bob') + 2):]
print "Number of times bob occurs is: {}".format(count)
If strings are large, you want to use Rabin-Karp, in summary:
a rolling window of substring size, moving over a string
a hash with O(1) overhead for adding and removing (i.e. move by 1 char)
implemented in C or relying on pypy
That can be solved using regex.
import re
def function(string, sub_string):
match = re.findall('(?='+sub_string+')',string)
return len(match)
def count_substring(string, sub_string):
counter = 0
for i in range(len(string)):
if string[i:].startswith(sub_string):
counter = counter + 1
return counter
Above code simply loops throughout the string once and keeps checking if any string is starting with the particular substring that is being counted.
re.subn hasn't been mentioned yet:
>>> import re
>>> re.subn('(?=11)', '', '1011101111')[1]
def count_overlaps (string, look_for):
start = 0
matches = 0
while True:
start = string.find (look_for, start)
if start < 0:
start += 1
matches += 1
return matches
print count_overlaps ('abrabra', 'abra')
Function that takes as input two strings and counts how many times sub occurs in string, including overlaps. To check whether sub is a substring, I used the in operator.
def count_Occurrences(string, sub):
for i in range(0, len(string)-len(sub)+1):
if sub in string[i:i+len(sub)]:
print 'Number of times sub occurs in string (including overlaps): ', count
For a duplicated question i've decided to count it 3 by 3 and comparing the string e.g.
counted = 0
for i in range(len(string)):
if string[i*3:(i+1)*3] == 'xox':
counted = counted +1
print counted
An alternative very close to the accepted answer but using while as the if test instead of including if inside the loop:
def countSubstr(string, sub):
count = 0
while sub in string:
count += 1
string = string[string.find(sub) + 1:]
return count;
This avoids while True: and is a little cleaner in my opinion
This is another example of using str.find() but a lot of the answers make it more complicated than necessary:
def occurrences(text, sub):
c, n = 0, text.find(sub)
while n != -1:
c += 1
n = text.find(sub, n+1)
return c
In []:
occurrences('1011101111', '11')
sequence = '1011101111'
sub = "11"
In this particular case:
sum(x == tuple(sub) for x in zip(sequence, sequence[1:]))
# 5
More generally, this
windows = zip(*([sequence[i:] for i, _ in enumerate(sequence)][:len(sub)]))
sum(x == tuple(sub) for x in windows)
# 5
or extend to generators:
import itertools as it
iter_ = (sequence[i:] for i, _ in enumerate(sequence))
windows = zip(*(it.islice(iter_, None, len(sub))))
sum(x == tuple(sub) for x in windows)
You can use more_itertools.locate:
import more_itertools as mit
len(list(mit.locate(sequence, pred=lambda *args: args == tuple(sub), window_size=len(sub))))
# 5
A simple way to count substring occurrence is to use count():
>>> s = 'bobob'
>>> s.count('bob')
You can use replace () to find overlapping strings if you know which part will be overlap:
>>> s = 'bobob'
>>> s.replace('b', 'bb').count('bob')
Note that besides being static, there are other limitations:
>>> s = 'aaa'
>>> count('aa') # there must be two occurrences
>>> s.replace('a', 'aa').count('aa')
def occurance_of_pattern(text, pattern):
text_len , pattern_len = len(text), len(pattern)
return sum(1 for idx in range(text_len - pattern_len + 1) if text[idx: idx+pattern_len] == pattern)
I wanted to see if the number of input of same prefix char is same postfix, e.g., "foo" and """foo"" but fail on """bar"":
from itertools import count, takewhile
from operator import eq
# From https://stackoverflow.com/a/15112059
def count_iter_items(iterable):
Consume an iterable not reading it into memory; return the number of items.
:param iterable: An iterable
:type iterable: ```Iterable```
:return: Number of items in iterable
:rtype: ```int```
counter = count()
deque(zip(iterable, counter), maxlen=0)
return next(counter)
def begin_matches_end(s):
Checks if the begin matches the end of the string
:param s: Input string of length > 0
:type s: ```str```
:return: Whether the beginning matches the end (checks first match chars
:rtype: ```bool```
return (count_iter_items(takewhile(partial(eq, s[0]), s)) ==
count_iter_items(takewhile(partial(eq, s[0]), s[::-1])))
Solution with replaced parts of the string
s = 'lolololol'
t = 0
t += s.count('lol')
s = s.replace('lol', 'lo1')
t += s.count('1ol')
print("Number of times lol occurs is:", t)
Answer is 4.
If you want to count permutation counts of length 5 (adjust if wanted for different lengths):
def MerCount(s):
for i in xrange(len(s)-4):
d[s[i:i+5]] += 1
return d

