Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 months ago.
The community reviewed whether to reopen this question 3 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
What's the best way to count the number of occurrences of a given string, including overlap in Python? This is one way:
def function(string, str_to_search_for):
count = 0
for x in xrange(len(string) - len(str_to_search_for) + 1):
if string[x:x+len(str_to_search_for)] == str_to_search_for:
count += 1
return count
function('1011101111','11')
This method returns 5.
Is there a better way in Python?
Well, this might be faster since it does the comparing in C:
def occurrences(string, sub):
count = start = 0
while True:
start = string.find(sub, start) + 1
if start > 0:
count+=1
else:
return count
>>> import re
>>> text = '1011101111'
>>> len(re.findall('(?=11)', text))
5
If you didn't want to load the whole list of matches into memory, which would never be a problem! you could do this if you really wanted:
>>> sum(1 for _ in re.finditer('(?=11)', text))
5
As a function (re.escape makes sure the substring doesn't interfere with the regex):
def occurrences(text, sub):
return len(re.findall('(?={0})'.format(re.escape(sub)), text))
>>> occurrences(text, '11')
5
You can also try using the new Python regex module, which supports overlapping matches.
import regex as re
def count_overlapping(text, search_for):
return len(re.findall(search_for, text, overlapped=True))
count_overlapping('1011101111','11') # 5
Python's str.count counts non-overlapping substrings:
In [3]: "ababa".count("aba")
Out[3]: 1
Here are a few ways to count overlapping sequences, I'm sure there are many more :)
Look-ahead regular expressions
How to find overlapping matches with a regexp?
In [10]: re.findall("a(?=ba)", "ababa")
Out[10]: ['a', 'a']
Generate all substrings
In [11]: data = "ababa"
In [17]: sum(1 for i in range(len(data)) if data.startswith("aba", i))
Out[17]: 2
def count_substring(string, sub_string):
count = 0
for pos in range(len(string)):
if string[pos:].startswith(sub_string):
count += 1
return count
This could be the easiest way.
A fairly pythonic way would be to use list comprehension here, although it probably wouldn't be the most efficient.
sequence = 'abaaadcaaaa'
substr = 'aa'
counts = sum([
sequence.startswith(substr, i) for i in range(len(sequence))
])
print(counts) # 5
The list would be [False, False, True, False, False, False, True, True, False, False] as it checks all indexes through the string, and because int(True) == 1, sum gives us the total number of matches.
s = "bobobob"
sub = "bob"
ln = len(sub)
print(sum(sub == s[i:i+ln] for i in xrange(len(s)-(ln-1))))
How to find a pattern in another string with overlapping
This function (another solution!) receive a pattern and a text. Returns a list with all the substring located in the and their positions.
def occurrences(pattern, text):
"""
input: search a pattern (regular expression) in a text
returns: a list of substrings and their positions
"""
p = re.compile('(?=({0}))'.format(pattern))
matches = re.finditer(p, text)
return [(match.group(1), match.start()) for match in matches]
print (occurrences('ana', 'banana'))
print (occurrences('.ana', 'Banana-fana fo-fana'))
[('ana', 1), ('ana', 3)]
[('Bana', 0), ('nana', 2), ('fana', 7), ('fana', 15)]
My answer, to the bob question on the course:
s = 'azcbobobegghaklbob'
total = 0
for i in range(len(s)-2):
if s[i:i+3] == 'bob':
total += 1
print 'number of times bob occurs is: ', total
Here is my edX MIT "find bob"* solution (*find number of "bob" occurences in a string named s), which basicaly counts overlapping occurrences of a given substing:
s = 'azcbobobegghakl'
count = 0
while 'bob' in s:
count += 1
s = s[(s.find('bob') + 2):]
print "Number of times bob occurs is: {}".format(count)
If strings are large, you want to use Rabin-Karp, in summary:
a rolling window of substring size, moving over a string
a hash with O(1) overhead for adding and removing (i.e. move by 1 char)
implemented in C or relying on pypy
That can be solved using regex.
import re
def function(string, sub_string):
match = re.findall('(?='+sub_string+')',string)
return len(match)
def count_substring(string, sub_string):
counter = 0
for i in range(len(string)):
if string[i:].startswith(sub_string):
counter = counter + 1
return counter
Above code simply loops throughout the string once and keeps checking if any string is starting with the particular substring that is being counted.
re.subn hasn't been mentioned yet:
>>> import re
>>> re.subn('(?=11)', '', '1011101111')[1]
5
def count_overlaps (string, look_for):
start = 0
matches = 0
while True:
start = string.find (look_for, start)
if start < 0:
break
start += 1
matches += 1
return matches
print count_overlaps ('abrabra', 'abra')
Function that takes as input two strings and counts how many times sub occurs in string, including overlaps. To check whether sub is a substring, I used the in operator.
def count_Occurrences(string, sub):
count=0
for i in range(0, len(string)-len(sub)+1):
if sub in string[i:i+len(sub)]:
count=count+1
print 'Number of times sub occurs in string (including overlaps): ', count
For a duplicated question i've decided to count it 3 by 3 and comparing the string e.g.
counted = 0
for i in range(len(string)):
if string[i*3:(i+1)*3] == 'xox':
counted = counted +1
print counted
An alternative very close to the accepted answer but using while as the if test instead of including if inside the loop:
def countSubstr(string, sub):
count = 0
while sub in string:
count += 1
string = string[string.find(sub) + 1:]
return count;
This avoids while True: and is a little cleaner in my opinion
This is another example of using str.find() but a lot of the answers make it more complicated than necessary:
def occurrences(text, sub):
c, n = 0, text.find(sub)
while n != -1:
c += 1
n = text.find(sub, n+1)
return c
In []:
occurrences('1011101111', '11')
Out[]:
5
Given
sequence = '1011101111'
sub = "11"
Code
In this particular case:
sum(x == tuple(sub) for x in zip(sequence, sequence[1:]))
# 5
More generally, this
windows = zip(*([sequence[i:] for i, _ in enumerate(sequence)][:len(sub)]))
sum(x == tuple(sub) for x in windows)
# 5
or extend to generators:
import itertools as it
iter_ = (sequence[i:] for i, _ in enumerate(sequence))
windows = zip(*(it.islice(iter_, None, len(sub))))
sum(x == tuple(sub) for x in windows)
Alternative
You can use more_itertools.locate:
import more_itertools as mit
len(list(mit.locate(sequence, pred=lambda *args: args == tuple(sub), window_size=len(sub))))
# 5
A simple way to count substring occurrence is to use count():
>>> s = 'bobob'
>>> s.count('bob')
1
You can use replace () to find overlapping strings if you know which part will be overlap:
>>> s = 'bobob'
>>> s.replace('b', 'bb').count('bob')
2
Note that besides being static, there are other limitations:
>>> s = 'aaa'
>>> count('aa') # there must be two occurrences
1
>>> s.replace('a', 'aa').count('aa')
3
def occurance_of_pattern(text, pattern):
text_len , pattern_len = len(text), len(pattern)
return sum(1 for idx in range(text_len - pattern_len + 1) if text[idx: idx+pattern_len] == pattern)
I wanted to see if the number of input of same prefix char is same postfix, e.g., "foo" and """foo"" but fail on """bar"":
from itertools import count, takewhile
from operator import eq
# From https://stackoverflow.com/a/15112059
def count_iter_items(iterable):
"""
Consume an iterable not reading it into memory; return the number of items.
:param iterable: An iterable
:type iterable: ```Iterable```
:return: Number of items in iterable
:rtype: ```int```
"""
counter = count()
deque(zip(iterable, counter), maxlen=0)
return next(counter)
def begin_matches_end(s):
"""
Checks if the begin matches the end of the string
:param s: Input string of length > 0
:type s: ```str```
:return: Whether the beginning matches the end (checks first match chars
:rtype: ```bool```
"""
return (count_iter_items(takewhile(partial(eq, s[0]), s)) ==
count_iter_items(takewhile(partial(eq, s[0]), s[::-1])))
Solution with replaced parts of the string
s = 'lolololol'
t = 0
t += s.count('lol')
s = s.replace('lol', 'lo1')
t += s.count('1ol')
print("Number of times lol occurs is:", t)
Answer is 4.
If you want to count permutation counts of length 5 (adjust if wanted for different lengths):
def MerCount(s):
for i in xrange(len(s)-4):
d[s[i:i+5]] += 1
return d
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 months ago.
The community reviewed whether to reopen this question 3 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
What's the best way to count the number of occurrences of a given string, including overlap in Python? This is one way:
def function(string, str_to_search_for):
count = 0
for x in xrange(len(string) - len(str_to_search_for) + 1):
if string[x:x+len(str_to_search_for)] == str_to_search_for:
count += 1
return count
function('1011101111','11')
This method returns 5.
Is there a better way in Python?
Well, this might be faster since it does the comparing in C:
def occurrences(string, sub):
count = start = 0
while True:
start = string.find(sub, start) + 1
if start > 0:
count+=1
else:
return count
>>> import re
>>> text = '1011101111'
>>> len(re.findall('(?=11)', text))
5
If you didn't want to load the whole list of matches into memory, which would never be a problem! you could do this if you really wanted:
>>> sum(1 for _ in re.finditer('(?=11)', text))
5
As a function (re.escape makes sure the substring doesn't interfere with the regex):
def occurrences(text, sub):
return len(re.findall('(?={0})'.format(re.escape(sub)), text))
>>> occurrences(text, '11')
5
You can also try using the new Python regex module, which supports overlapping matches.
import regex as re
def count_overlapping(text, search_for):
return len(re.findall(search_for, text, overlapped=True))
count_overlapping('1011101111','11') # 5
Python's str.count counts non-overlapping substrings:
In [3]: "ababa".count("aba")
Out[3]: 1
Here are a few ways to count overlapping sequences, I'm sure there are many more :)
Look-ahead regular expressions
How to find overlapping matches with a regexp?
In [10]: re.findall("a(?=ba)", "ababa")
Out[10]: ['a', 'a']
Generate all substrings
In [11]: data = "ababa"
In [17]: sum(1 for i in range(len(data)) if data.startswith("aba", i))
Out[17]: 2
def count_substring(string, sub_string):
count = 0
for pos in range(len(string)):
if string[pos:].startswith(sub_string):
count += 1
return count
This could be the easiest way.
A fairly pythonic way would be to use list comprehension here, although it probably wouldn't be the most efficient.
sequence = 'abaaadcaaaa'
substr = 'aa'
counts = sum([
sequence.startswith(substr, i) for i in range(len(sequence))
])
print(counts) # 5
The list would be [False, False, True, False, False, False, True, True, False, False] as it checks all indexes through the string, and because int(True) == 1, sum gives us the total number of matches.
s = "bobobob"
sub = "bob"
ln = len(sub)
print(sum(sub == s[i:i+ln] for i in xrange(len(s)-(ln-1))))
How to find a pattern in another string with overlapping
This function (another solution!) receive a pattern and a text. Returns a list with all the substring located in the and their positions.
def occurrences(pattern, text):
"""
input: search a pattern (regular expression) in a text
returns: a list of substrings and their positions
"""
p = re.compile('(?=({0}))'.format(pattern))
matches = re.finditer(p, text)
return [(match.group(1), match.start()) for match in matches]
print (occurrences('ana', 'banana'))
print (occurrences('.ana', 'Banana-fana fo-fana'))
[('ana', 1), ('ana', 3)]
[('Bana', 0), ('nana', 2), ('fana', 7), ('fana', 15)]
My answer, to the bob question on the course:
s = 'azcbobobegghaklbob'
total = 0
for i in range(len(s)-2):
if s[i:i+3] == 'bob':
total += 1
print 'number of times bob occurs is: ', total
Here is my edX MIT "find bob"* solution (*find number of "bob" occurences in a string named s), which basicaly counts overlapping occurrences of a given substing:
s = 'azcbobobegghakl'
count = 0
while 'bob' in s:
count += 1
s = s[(s.find('bob') + 2):]
print "Number of times bob occurs is: {}".format(count)
If strings are large, you want to use Rabin-Karp, in summary:
a rolling window of substring size, moving over a string
a hash with O(1) overhead for adding and removing (i.e. move by 1 char)
implemented in C or relying on pypy
That can be solved using regex.
import re
def function(string, sub_string):
match = re.findall('(?='+sub_string+')',string)
return len(match)
def count_substring(string, sub_string):
counter = 0
for i in range(len(string)):
if string[i:].startswith(sub_string):
counter = counter + 1
return counter
Above code simply loops throughout the string once and keeps checking if any string is starting with the particular substring that is being counted.
re.subn hasn't been mentioned yet:
>>> import re
>>> re.subn('(?=11)', '', '1011101111')[1]
5
def count_overlaps (string, look_for):
start = 0
matches = 0
while True:
start = string.find (look_for, start)
if start < 0:
break
start += 1
matches += 1
return matches
print count_overlaps ('abrabra', 'abra')
Function that takes as input two strings and counts how many times sub occurs in string, including overlaps. To check whether sub is a substring, I used the in operator.
def count_Occurrences(string, sub):
count=0
for i in range(0, len(string)-len(sub)+1):
if sub in string[i:i+len(sub)]:
count=count+1
print 'Number of times sub occurs in string (including overlaps): ', count
For a duplicated question i've decided to count it 3 by 3 and comparing the string e.g.
counted = 0
for i in range(len(string)):
if string[i*3:(i+1)*3] == 'xox':
counted = counted +1
print counted
An alternative very close to the accepted answer but using while as the if test instead of including if inside the loop:
def countSubstr(string, sub):
count = 0
while sub in string:
count += 1
string = string[string.find(sub) + 1:]
return count;
This avoids while True: and is a little cleaner in my opinion
This is another example of using str.find() but a lot of the answers make it more complicated than necessary:
def occurrences(text, sub):
c, n = 0, text.find(sub)
while n != -1:
c += 1
n = text.find(sub, n+1)
return c
In []:
occurrences('1011101111', '11')
Out[]:
5
Given
sequence = '1011101111'
sub = "11"
Code
In this particular case:
sum(x == tuple(sub) for x in zip(sequence, sequence[1:]))
# 5
More generally, this
windows = zip(*([sequence[i:] for i, _ in enumerate(sequence)][:len(sub)]))
sum(x == tuple(sub) for x in windows)
# 5
or extend to generators:
import itertools as it
iter_ = (sequence[i:] for i, _ in enumerate(sequence))
windows = zip(*(it.islice(iter_, None, len(sub))))
sum(x == tuple(sub) for x in windows)
Alternative
You can use more_itertools.locate:
import more_itertools as mit
len(list(mit.locate(sequence, pred=lambda *args: args == tuple(sub), window_size=len(sub))))
# 5
A simple way to count substring occurrence is to use count():
>>> s = 'bobob'
>>> s.count('bob')
1
You can use replace () to find overlapping strings if you know which part will be overlap:
>>> s = 'bobob'
>>> s.replace('b', 'bb').count('bob')
2
Note that besides being static, there are other limitations:
>>> s = 'aaa'
>>> count('aa') # there must be two occurrences
1
>>> s.replace('a', 'aa').count('aa')
3
def occurance_of_pattern(text, pattern):
text_len , pattern_len = len(text), len(pattern)
return sum(1 for idx in range(text_len - pattern_len + 1) if text[idx: idx+pattern_len] == pattern)
I wanted to see if the number of input of same prefix char is same postfix, e.g., "foo" and """foo"" but fail on """bar"":
from itertools import count, takewhile
from operator import eq
# From https://stackoverflow.com/a/15112059
def count_iter_items(iterable):
"""
Consume an iterable not reading it into memory; return the number of items.
:param iterable: An iterable
:type iterable: ```Iterable```
:return: Number of items in iterable
:rtype: ```int```
"""
counter = count()
deque(zip(iterable, counter), maxlen=0)
return next(counter)
def begin_matches_end(s):
"""
Checks if the begin matches the end of the string
:param s: Input string of length > 0
:type s: ```str```
:return: Whether the beginning matches the end (checks first match chars
:rtype: ```bool```
"""
return (count_iter_items(takewhile(partial(eq, s[0]), s)) ==
count_iter_items(takewhile(partial(eq, s[0]), s[::-1])))
Solution with replaced parts of the string
s = 'lolololol'
t = 0
t += s.count('lol')
s = s.replace('lol', 'lo1')
t += s.count('1ol')
print("Number of times lol occurs is:", t)
Answer is 4.
If you want to count permutation counts of length 5 (adjust if wanted for different lengths):
def MerCount(s):
for i in xrange(len(s)-4):
d[s[i:i+5]] += 1
return d
I have a string that holds a very long sentence without whitespaces/spaces.
mystring = "abcdthisisatextwithsampletextforasampleabcd"
I would like to find all of the repeated substrings that contains minimum 4 chars.
So I would like to achieve something like this:
'text' 2 times
'sample' 2 times
'abcd' 2 times
As both abcd,text and sample can be found two times in the mystring they were recognized as properly matched substrings with more than 4 char length. It's important that I am seeking repeated substrings, finding only existing English words is not a requirement.
The answers I found are helpful for finding duplicates in texts with whitespaces, but I couldn't find a proper resource that covers the situation when there are no spaces and whitespaces in the string. How can this be done in the most efficient way?
Let's go through this step by step. There are several sub-tasks you should take care of:
Identify all substrings of length 4 or more.
Count the occurrence of these substrings.
Filter all substrings with 2 occurrences or more.
You can actually put all of them into a few statements. For understanding, it is easier to go through them one at a time.
The following examples all use
mystring = "abcdthisisatextwithsampletextforasampleabcd"
min_length = 4
1. Substrings of a given length
You can easily get substrings by slicing - for example, mystring[4:4+6] gives you the substring from position 4 of length 6: 'thisis'. More generically, you want substrings of the form mystring[start:start+length].
So what values do you need for start and length?
start must...
cover all substrings, so it must include the first character: start in range(0, ...).
not map to short substrings, so it can stop min_length characters before the end: start in range(..., len(mystring) - min_length + 1).
length must...
cover the shortest substring of length 4: length in range(min_length, ...).
not exceed the remaining string after i: length in range(..., len(mystring) - i + 1))
The +1 terms come from converting lengths (>=1) to indices (>=0).
You can put this all together into a single comprehension:
substrings = [
mystring[i:i+j]
for i in range(0, len(mystring) - min_length + 1)
for j in range(min_length, len(mystring) - i + 1)
]
2. Count substrings
Trivially, you want to keep a count for each substring. Keeping anything for each specific object is what dicts are made for. So you should use substrings as keys and counts as values in a dict. In essence, this corresponds to this:
counts = {}
for substring in substrings:
try: # increase count for existing keys, set for new keys
counts[substring] += 1
except KeyError:
counts[substring] = 1
You can simply feed your substrings to collections.Counter, and it produces something like the above.
>>> counts = collections.Counter(substrings)
>>> print(counts)
Counter({'abcd': 2, 'abcdt': 1, 'abcdth': 1, 'abcdthi': 1, 'abcdthis': 1, ...})
Notice how the duplicate 'abcd' maps to the count of 2.
3. Filtering duplicate substrings
So now you have your substrings and the count for each. You need to remove the non-duplicate substrings - those with a count of 1.
Python offers several constructs for filtering, depending on the output you want. These work also if counts is a regular dict:
>>> list(filter(lambda key: counts[key] > 1, counts))
['abcd', 'text', 'samp', 'sampl', 'sample', 'ampl', 'ample', 'mple']
>>> {key: value for key, value in counts.items() if value > 1}
{'abcd': 2, 'ampl': 2, 'ample': 2, 'mple': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'text': 2}
Using Python primitives
Python ships with primitives that allow you to do this more efficiently.
Use a generator to build substrings. A generator builds its member on the fly, so you never actually have them all in-memory. For your use case, you can use a generator expression:
substrings = (
mystring[i:i+j]
for i in range(0, len(mystring) - min_length + 1)
for j in range(min_length, len(mystring) - i + 1)
)
Use a pre-existing Counter implementation. Python comes with a dict-like container that counts its members: collections.Counter can directly digest your substring generator. Especially in newer version, this is much more efficient.
counts = collections.Counter(substrings)
You can exploit Python's lazy filters to only ever inspect one substring. The filter builtin or another generator generator expression can produce one result at a time without storing them all in memory.
for substring in filter(lambda key: counts[key] > 1, counts):
print(substring, 'occurs', counts[substring], 'times')
Nobody is using re! Time for an answer [ab]using the regular expression built-in module ;)
import re
Finding all the maximal substrings that are repeated
repeated_ones = set(re.findall(r"(.{4,})(?=.*\1)", mystring))
This matches the longest substrings which have at least a single repetition after (without consuming). So it finds all disjointed substrings that are repeated while only yielding the longest strings.
Finding all substrings that are repeated, including overlaps
mystring_overlap = "abcdeabcdzzzzbcde"
# In case we want to match both abcd and bcde
repeated_ones = set()
pos = 0
while True:
match = re.search(r"(.{4,}).*(\1)+", mystring_overlap[pos:])
if match:
repeated_ones.add(match.group(1))
pos += match.pos + 1
else:
break
This ensures that all --not only disjoint-- substrings which have repetition are returned. It should be much slower, but gets the work done.
If you want in addition to the longest strings that are repeated, all the substrings, then:
base_repetitions = list(repeated_ones)
for s in base_repetitions:
for i in range(4, len(s)):
repeated_ones.add(s[:i])
That will ensure that for long substrings that have repetition, you have also the smaller substring --e.g. "sample" and "ample" found by the re.search code; but also "samp", "sampl", "ampl" added by the above snippet.
Counting matches
Because (by design) the substrings that we count are non-overlapping, the count method is the way to go:
from __future__ import print_function
for substr in repeated_ones:
print("'%s': %d times" % (substr, mystring.count(substr)))
Results
Finding maximal substrings:
With the question's original mystring:
{'abcd', 'text', 'sample'}
with the mystring_overlap sample:
{'abcd'}
Finding all substrings:
With the question's original mystring:
{'abcd', 'ample', 'mple', 'sample', 'text'}
... and if we add the code to get all substrings then, of course, we get absolutely all the substrings:
{'abcd', 'ampl', 'ample', 'mple', 'samp', 'sampl', 'sample', 'text'}
with the mystring_overlap sample:
{'abcd', 'bcde'}
Future work
It's possible to filter the results of the finding all substrings with the following steps:
take a match "A"
check if this match is a substring of another match, call it "B"
if there is a "B" match, check the counter on that match "B_n"
if "A_n = B_n", then remove A
go to first step
It cannot happen that "A_n < B_n" because A is smaller than B (is a substring) so there must be at least the same number of repetitions.
If "A_n > B_n" it means that there is some extra match of the smaller substring, so it is a distinct substring because it is repeated in a place where B is not repeated.
Script (explanation where needed, in comments):
from collections import Counter
mystring = "abcdthisisatextwithsampletextforasampleabcd"
mystring_len = len(mystring)
possible_matches = []
matches = []
# Range `start_index` from 0 to 3 from the left, due to minimum char count of 4
for start_index in range(0, mystring_len-3):
# Start `end_index` at `start_index+1` and range it throughout the rest of
# the string
for end_index in range(start_index+1, mystring_len+1):
current_string = mystring[start_index:end_index]
if len(current_string) < 4: continue # Skip this interation, if len < 4
possible_matches.append(mystring[start_index:end_index])
for possible_match, count in Counter(possible_matches).most_common():
# Iterate until count is less than or equal to 1 because `Counter`'s
# `most_common` method lists them in order. Once 1 (or less) is hit, all
# others are the same or lower.
if count <= 1: break
matches.append((possible_match, count))
for match, count in matches:
print(f'\'{match}\' {count} times')
Output:
'abcd' 2 times
'text' 2 times
'samp' 2 times
'sampl' 2 times
'sample' 2 times
'ampl' 2 times
'ample' 2 times
'mple' 2 times
Here's a Python3 friendly solution:
from collections import Counter
min_str_length = 4
mystring = "abcdthisisatextwithsampletextforasampleabcd"
all_substrings =[mystring[start_index:][:end_index + 1] for start_index in range(len(mystring)) for end_index in range(len(mystring[start_index:]))]
counted_substrings = Counter(all_substrings)
not_counted_final_candidates = [item[0] for item in counted_substrings.most_common() if item[1] > 1 and len(item[0]) >= min_str_length]
counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}
print(counted_final_candidates)
Bonus: largest string
sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in not_counted_final_candidates if substring1!=substring2 and substring1 in substring2 ]
largest_common_string = list(set(not_counted_final_candidates) - set(sub_sub_strings))
Everything as a function:
from collections import Counter
def get_repeated_strings(input_string, min_str_length = 2, calculate_largest_repeated_string = True ):
all_substrings = [input_string[start_index:][:end_index + 1]
for start_index in range(len(input_string))
for end_index in range(len(input_string[start_index:]))]
counted_substrings = Counter(all_substrings)
not_counted_final_candidates = [item[0]
for item in counted_substrings.most_common()
if item[1] > 1 and len(item[0]) >= min_str_length]
counted_final_candidates = {item: counted_substrings[item] for item in not_counted_final_candidates}
### This is just a bit of bonus code for calculating the largest repeating sting
if calculate_largest_repeated_string == True:
sub_sub_strings = [substring1 for substring1 in not_counted_final_candidates for substring2 in
not_counted_final_candidates if substring1 != substring2 and substring1 in substring2]
largest_common_strings = list(set(not_counted_final_candidates) - set(sub_sub_strings))
return counted_final_candidates, largest_common_strings
else:
return counted_final_candidates
Example:
mystring = "abcdthisisatextwithsampletextforasampleabcd"
print(get_repeated_strings(mystring, min_str_length= 4))
Output:
({'abcd': 2, 'text': 2, 'samp': 2, 'sampl': 2, 'sample': 2, 'ampl': 2, 'ample': 2, 'mple': 2}, ['abcd', 'text', 'sample'])
CODE:
pattern = "abcdthisisatextwithsampletextforasampleabcd"
string_more_4 = []
k = 4
while(k <= len(pattern)):
for i in range(len(pattern)):
if pattern[i:k+i] not in string_more_4 and len(pattern[i:k+i]) >= 4:
string_more_4.append( pattern[i:k+i])
k+=1
for i in string_more_4:
if pattern.count(i) >= 2:
print(i + " -> " + str(pattern.count(i)) + " times")
OUTPUT:
abcd -> 2 times
text -> 2 times
samp -> 2 times
ampl -> 2 times
mple -> 2 times
sampl -> 2 times
ample -> 2 times
sample -> 2 times
Hope this helps as my code length was short and it is easy to understand. Cheers!
This is in Python 2 because I'm not doing Python 3 at this time. So you'll have to adapt it to Python 3 yourself.
#!python2
# import module
from collections import Counter
# get the indices
def getIndices(length):
# holds the indices
specific_range = []; all_sets = []
# start building the indices
for i in range(0, length - 2):
# build a set of indices of a specific range
for j in range(1, length + 2):
specific_range.append([j - 1, j + i + 3])
# append 'specific_range' to 'all_sets', reset 'specific_range'
if specific_range[j - 1][1] == length:
all_sets.append(specific_range)
specific_range = []
break
# return all of the calculated indices ranges
return all_sets
# store search strings
tmplst = []; combos = []; found = []
# string to be searched
mystring = "abcdthisisatextwithsampletextforasampleabcd"
# mystring = "abcdthisisatextwithtextsampletextforasampleabcdtext"
# get length of string
length = len(mystring)
# get all of the indices ranges, 4 and greater
all_sets = getIndices(length)
# get the search string combinations
for sublst in all_sets:
for subsublst in sublst:
tmplst.append(mystring[subsublst[0]: subsublst[1]])
combos.append(tmplst)
tmplst = []
# search for matching string patterns
for sublst in all_sets:
for subsublst in sublst:
for sublstitems in combos:
if mystring[subsublst[0]: subsublst[1]] in sublstitems:
found.append(mystring[subsublst[0]: subsublst[1]])
# make a dictionary containing the strings and their counts
d1 = Counter(found)
# filter out counts of 2 or more and print them
for k, v in d1.items():
if v > 1:
print k, v
$ cat test.py
import collections
import sys
S = "abcdthisisatextwithsampletextforasampleabcd"
def find(s, min_length=4):
"""
Find repeated character sequences in a provided string.
Arguments:
s -- the string to be searched
min_length -- the minimum length of the sequences to be found
"""
counter = collections.defaultdict(int)
# A repeated sequence can't be longer than half the length of s
sequence_length = len(s) // 2
# populate counter with all possible sequences
while sequence_length >= min_length:
# Iterate over the string until the number of remaining characters is
# fewer than the length of the current sequence.
for i, x in enumerate(s[:-(sequence_length - 1)]):
# Window across the string, getting slices
# of length == sequence_length.
candidate = s[i:i + sequence_length]
counter[candidate] += 1
sequence_length -= 1
# Report.
for k, v in counter.items():
if v > 1:
print('{} {} times'.format(k, v))
return
if __name__ == '__main__':
try:
s = sys.argv[1]
except IndexError:
s = S
find(s)
$ python test.py
sample 2 times
sampl 2 times
ample 2 times
abcd 2 times
text 2 times
samp 2 times
ampl 2 times
mple 2 times
This is my approach to this problem:
def get_repeated_words(string, minimum_len):
# Storing count of repeated words in this dictionary
repeated_words = {}
# Traversing till last but 4th element
# Actually leaving `minimum_len` elements at end (in this case its 4)
for i in range(len(string)-minimum_len):
# Starting with a length of 4(`minimum_len`) and going till end of string
for j in range(i+minimum_len, len(string)):
# getting the current word
word = string[i:j]
# counting the occurrences of the word
word_count = string.count(word)
if word_count > 1:
# storing in dictionary along with its count if found more than once
repeated_words[word] = word_count
return repeated_words
if __name__ == '__main__':
mystring = "abcdthisisatextwithsampletextforasampleabcd"
result = get_repeated_words(mystring, 4)
This is how I would do it, but I don't know any other way:
string = "abcdthisisatextwithsampletextforasampleabcd"
l = len(string)
occurences = {}
for i in range(4, l):
for start in range(l - i):
substring = string[start:start + i]
occurences[substring] = occurences.get(substring, 0) + 1
for key in occurences.keys():
if occurences[key] > 1:
print("'" + key + "'", str(occurences[key]), "times")
Output:
'sample' 2 times
'ampl' 2 times
'sampl' 2 times
'ample' 2 times
'samp' 2 times
'mple' 2 times
'text' 2 times
Efficient, no, but easy to understand, yes.
Here is simple solution using the more_itertools library.
Given
import collections as ct
import more_itertools as mit
s = "abcdthisisatextwithsampletextforasampleabcd"
lbound, ubound = len("abcd"), len(s)
Code
windows = mit.flatten(mit.windowed(s, n=i) for i in range(lbound, ubound))
filtered = {"".join(k): v for k, v in ct.Counter(windows).items() if v > 1}
filtered
Output
{'abcd': 2,
'text': 2,
'samp': 2,
'ampl': 2,
'mple': 2,
'sampl': 2,
'ample': 2,
'sample': 2}
Details
The procedures are:
build sliding windows of varying sizes lbound <= n < ubound
count all occurrences and filter replicates
more_itertools is a third-party package installed by > pip install more_itertools.
s = 'abcabcabcdabcd'
d = {}
def get_repeats(s, l):
for i in range(len(s)-l):
ss = s[i: i+l]
if ss not in d:
d[ss] = 1
else:
d[ss] = d[ss]+1
return d
get_repeats(s, 3)
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 months ago.
The community reviewed whether to reopen this question 3 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
What's the best way to count the number of occurrences of a given string, including overlap in Python? This is one way:
def function(string, str_to_search_for):
count = 0
for x in xrange(len(string) - len(str_to_search_for) + 1):
if string[x:x+len(str_to_search_for)] == str_to_search_for:
count += 1
return count
function('1011101111','11')
This method returns 5.
Is there a better way in Python?
Well, this might be faster since it does the comparing in C:
def occurrences(string, sub):
count = start = 0
while True:
start = string.find(sub, start) + 1
if start > 0:
count+=1
else:
return count
>>> import re
>>> text = '1011101111'
>>> len(re.findall('(?=11)', text))
5
If you didn't want to load the whole list of matches into memory, which would never be a problem! you could do this if you really wanted:
>>> sum(1 for _ in re.finditer('(?=11)', text))
5
As a function (re.escape makes sure the substring doesn't interfere with the regex):
def occurrences(text, sub):
return len(re.findall('(?={0})'.format(re.escape(sub)), text))
>>> occurrences(text, '11')
5
You can also try using the new Python regex module, which supports overlapping matches.
import regex as re
def count_overlapping(text, search_for):
return len(re.findall(search_for, text, overlapped=True))
count_overlapping('1011101111','11') # 5
Python's str.count counts non-overlapping substrings:
In [3]: "ababa".count("aba")
Out[3]: 1
Here are a few ways to count overlapping sequences, I'm sure there are many more :)
Look-ahead regular expressions
How to find overlapping matches with a regexp?
In [10]: re.findall("a(?=ba)", "ababa")
Out[10]: ['a', 'a']
Generate all substrings
In [11]: data = "ababa"
In [17]: sum(1 for i in range(len(data)) if data.startswith("aba", i))
Out[17]: 2
def count_substring(string, sub_string):
count = 0
for pos in range(len(string)):
if string[pos:].startswith(sub_string):
count += 1
return count
This could be the easiest way.
A fairly pythonic way would be to use list comprehension here, although it probably wouldn't be the most efficient.
sequence = 'abaaadcaaaa'
substr = 'aa'
counts = sum([
sequence.startswith(substr, i) for i in range(len(sequence))
])
print(counts) # 5
The list would be [False, False, True, False, False, False, True, True, False, False] as it checks all indexes through the string, and because int(True) == 1, sum gives us the total number of matches.
s = "bobobob"
sub = "bob"
ln = len(sub)
print(sum(sub == s[i:i+ln] for i in xrange(len(s)-(ln-1))))
How to find a pattern in another string with overlapping
This function (another solution!) receive a pattern and a text. Returns a list with all the substring located in the and their positions.
def occurrences(pattern, text):
"""
input: search a pattern (regular expression) in a text
returns: a list of substrings and their positions
"""
p = re.compile('(?=({0}))'.format(pattern))
matches = re.finditer(p, text)
return [(match.group(1), match.start()) for match in matches]
print (occurrences('ana', 'banana'))
print (occurrences('.ana', 'Banana-fana fo-fana'))
[('ana', 1), ('ana', 3)]
[('Bana', 0), ('nana', 2), ('fana', 7), ('fana', 15)]
My answer, to the bob question on the course:
s = 'azcbobobegghaklbob'
total = 0
for i in range(len(s)-2):
if s[i:i+3] == 'bob':
total += 1
print 'number of times bob occurs is: ', total
Here is my edX MIT "find bob"* solution (*find number of "bob" occurences in a string named s), which basicaly counts overlapping occurrences of a given substing:
s = 'azcbobobegghakl'
count = 0
while 'bob' in s:
count += 1
s = s[(s.find('bob') + 2):]
print "Number of times bob occurs is: {}".format(count)
If strings are large, you want to use Rabin-Karp, in summary:
a rolling window of substring size, moving over a string
a hash with O(1) overhead for adding and removing (i.e. move by 1 char)
implemented in C or relying on pypy
That can be solved using regex.
import re
def function(string, sub_string):
match = re.findall('(?='+sub_string+')',string)
return len(match)
def count_substring(string, sub_string):
counter = 0
for i in range(len(string)):
if string[i:].startswith(sub_string):
counter = counter + 1
return counter
Above code simply loops throughout the string once and keeps checking if any string is starting with the particular substring that is being counted.
re.subn hasn't been mentioned yet:
>>> import re
>>> re.subn('(?=11)', '', '1011101111')[1]
5
def count_overlaps (string, look_for):
start = 0
matches = 0
while True:
start = string.find (look_for, start)
if start < 0:
break
start += 1
matches += 1
return matches
print count_overlaps ('abrabra', 'abra')
Function that takes as input two strings and counts how many times sub occurs in string, including overlaps. To check whether sub is a substring, I used the in operator.
def count_Occurrences(string, sub):
count=0
for i in range(0, len(string)-len(sub)+1):
if sub in string[i:i+len(sub)]:
count=count+1
print 'Number of times sub occurs in string (including overlaps): ', count
For a duplicated question i've decided to count it 3 by 3 and comparing the string e.g.
counted = 0
for i in range(len(string)):
if string[i*3:(i+1)*3] == 'xox':
counted = counted +1
print counted
An alternative very close to the accepted answer but using while as the if test instead of including if inside the loop:
def countSubstr(string, sub):
count = 0
while sub in string:
count += 1
string = string[string.find(sub) + 1:]
return count;
This avoids while True: and is a little cleaner in my opinion
This is another example of using str.find() but a lot of the answers make it more complicated than necessary:
def occurrences(text, sub):
c, n = 0, text.find(sub)
while n != -1:
c += 1
n = text.find(sub, n+1)
return c
In []:
occurrences('1011101111', '11')
Out[]:
5
Given
sequence = '1011101111'
sub = "11"
Code
In this particular case:
sum(x == tuple(sub) for x in zip(sequence, sequence[1:]))
# 5
More generally, this
windows = zip(*([sequence[i:] for i, _ in enumerate(sequence)][:len(sub)]))
sum(x == tuple(sub) for x in windows)
# 5
or extend to generators:
import itertools as it
iter_ = (sequence[i:] for i, _ in enumerate(sequence))
windows = zip(*(it.islice(iter_, None, len(sub))))
sum(x == tuple(sub) for x in windows)
Alternative
You can use more_itertools.locate:
import more_itertools as mit
len(list(mit.locate(sequence, pred=lambda *args: args == tuple(sub), window_size=len(sub))))
# 5
A simple way to count substring occurrence is to use count():
>>> s = 'bobob'
>>> s.count('bob')
1
You can use replace () to find overlapping strings if you know which part will be overlap:
>>> s = 'bobob'
>>> s.replace('b', 'bb').count('bob')
2
Note that besides being static, there are other limitations:
>>> s = 'aaa'
>>> count('aa') # there must be two occurrences
1
>>> s.replace('a', 'aa').count('aa')
3
def occurance_of_pattern(text, pattern):
text_len , pattern_len = len(text), len(pattern)
return sum(1 for idx in range(text_len - pattern_len + 1) if text[idx: idx+pattern_len] == pattern)
I wanted to see if the number of input of same prefix char is same postfix, e.g., "foo" and """foo"" but fail on """bar"":
from itertools import count, takewhile
from operator import eq
# From https://stackoverflow.com/a/15112059
def count_iter_items(iterable):
"""
Consume an iterable not reading it into memory; return the number of items.
:param iterable: An iterable
:type iterable: ```Iterable```
:return: Number of items in iterable
:rtype: ```int```
"""
counter = count()
deque(zip(iterable, counter), maxlen=0)
return next(counter)
def begin_matches_end(s):
"""
Checks if the begin matches the end of the string
:param s: Input string of length > 0
:type s: ```str```
:return: Whether the beginning matches the end (checks first match chars
:rtype: ```bool```
"""
return (count_iter_items(takewhile(partial(eq, s[0]), s)) ==
count_iter_items(takewhile(partial(eq, s[0]), s[::-1])))
Solution with replaced parts of the string
s = 'lolololol'
t = 0
t += s.count('lol')
s = s.replace('lol', 'lo1')
t += s.count('1ol')
print("Number of times lol occurs is:", t)
Answer is 4.
If you want to count permutation counts of length 5 (adjust if wanted for different lengths):
def MerCount(s):
for i in xrange(len(s)-4):
d[s[i:i+5]] += 1
return d
How can I compare all strings in a list e.g:
"A-B-C-D-E-F-H-A",
"A-B-C-F-G-H-M-P",
And output until which character they are identical:
In the example above it would be:
Character 6
And output the most similar strings.
I tried with collections.Counter but that did not work.
You're trying to go character by character in the two strings in lockstep. This is a job for zip:
A = "A-B-C-D-E-F-H-A"
B = "A-B-C-F-G-H-M-P"
count = 0
for a, b in zip(A, B):
if a == b:
count += 1
else:
break
Or, if you prefer "…as long as they are…" is a job for takewhile:
from itertools import takewhile
from operator import eq
def ilen(iterable): return sum(1 for _ in iterable)
count = ilen(takewhile(lambda ab: eq(*ab), zip(A, B)))
If you have a list of these strings, and you want to compare every string to every other string:
First, you turn the above code into a function. I'll do it with the itertools version, but you can do it with the other just as easily:
def shared_prefix(A, B):
return ilen(takewhile(lambda ab: eq(*ab), zip(A, B)))
Now, for every string, you compare it to all the rest of the strings. There's an easy way to do it with combinations:
from itertools import combinations
counts = [shared_prefix(pair) for pair in combinations(list_o_strings, 2)]
But if you don't understand that, you can write it as a nested loop. The only tricky part is what "the rest of the strings" means. You can't loop over all the strings in both the outer and inner loops, or you'll compare each pair of strings twice (once in each order), and compare each string to itself. So it has to mean "all the strings after the current one". Like this:
counts = []
for i, s1 in enumerate(list_o_strings):
for s2 in list_o_strings[i+1:]:
counts.append(prefix(s1, s2))
I think this code will solve your problem.
listA = "A-B-C-D-E-F-H-A"
listB = "A-B-C-F-G-H-M-P"
newListA = listA.replace ("-", "")
newListB = listB.replace ("-", "")
# newListA = "ABCDEFHA"
# newListB = "ABCFGHMP"
i = 0
exit = 0
while ((i < len (newListA)) & (exit == 0)):
if (newListA[i] != newListB[i]):
exit = 1
i = i + 1
print ("Character: " + str(i))