No. of times a sub-string occurs in the string - python

Below is my code:
def count_substring(string, sub_string):
counter = 0
for x in range(0,len(string)):
if string[x]+string[x+1]+string[x+2] == sub_string:
counter +=1
return counter
When I run the code it throws an error - "IndexError: string index out of range"
Please help me in understanding what is wrong with my code and also with the solution.
I am a beginner in Python. Please explain this to me like I am 5.

Can't you simple use str.count for non-overlapping matches:
str.count(substring, [start_index], [end_index])
full_str = 'Test for substring, check for word check'
sub_str = 'check'
print(full_str.count(sub_str))
Returns 2
If you have overlapping matches of your substring you could try re.findall with a positive lookahead:
import re
full_str = 'bobob'
sub_str = 'bob'
print(len(re.findall('(?='+sub_str+')',full_str)))
If you got the new regex.findall module and you want to count as such, try to use the overlapping parameter in re.findall and set it to true:
import regex as re
full_str = 'bobob'
sub_str = 'bob'
print(len(re.findall(sub_str, full_str, overlapped=True)))
Both options will return: 2

Couldn't you just use count? It uses way less code. See JvdV's answer. Also, by the way, this is how I can do it:
def count_substring(string, substring)
print(string.count(substring))
This simplifies code by a lot and also you could just get rid of the function entirely and do this:
print(string.count(substring)) # by the way you have to define string and substring first
If you want to include overlapping strings, then do this:
def count(string, substring):
string_size = len(string)
substring_size = len(substring)
count = 0
for i in xrange(0, string_size-substring_size+1):
if string[ i:i + substring_size] == substring:
count += 1
return count

String has built-in method count for this purpose.
string = 'This is the way to do it.'
string.count('is')
Output: 2

Related

Find all permutation of a string in another string

I have a string, for example: string1 = 'abcdbcabdcabb'.
And I have another string, for example: string2 = 'cab'
I need to count all permutation of string2 in string1.
Currently I'm adding all permutation of string2 to a list,
and than iterating threw string1 by index+string.size and checking
if sub-string of string1 contain in the list of the permutations
I'm sure there is a better optimized way to do it.
You do not need DP in my mind, but a sliding window technic. A permutation of string2 is a string that has exactly the same length and the distribution of the characters is the same. In your example of string2, a permutation is. a string of length 3 with this distribution of characters: {a:1,b:1,c:1}.
So you can write a script, to consider a window of size N (size of string2), from the beginning of string1(index=0). if your current window has exactly the same distribution of characters, you accept it as a permutation, if not you do not count it, and you move on to index+1.
A trick for not recalculating the character distribution in each sliding window, you can get a dictionary of characters, and count the characters at the very first window, then when you slide the window one to the right, you decrease the removing character by one, and increase the adding character by 1.
The code should be something like this, you need to verify it for edge cases:
def get_permut(string1,string2):
N =len(string2)
M = len(string1)
if M < N:
return 0
valid_dist = dict()
for ch in string2:
valid_dist.setdefault(ch,0)
valid_dist[ch]+=1
current_dist=dict()
for ch in string1[:N]:
current_dist.setdefault(ch,0)
current_dist[ch]+=1
ct=0
for i in range(M-N):
if current_dist == valid_dist:
ct+=1
current_dist[i]-=1
current_dist.setdefault(i+1,0)
current_dist[i+1]+=1
if current_dist[i]==0:
del current_dist[i]
return ct
You can use string.count() method here. See below for some way to resolve it:
import itertools
perms=[''.join(i) for i in itertools.permutations(string2)]
res=0
for i in perms:
res+= string1.count(i)
print(res)
# 4
You can use regex for that.
def lst_permutation(text):
from itertools import permutations
lst_p=[]
for i in permutations(text):
lst_p.append(''.join(i))
return lst_p
def total_permutation(string1,string2):
import re
total=0
for i in lst_permutation(string2):
res=re.findall(string2,string1)
total += len(res)
return total
string1 = 'abcdbcabdcabb'
string2 = 'cab'
print(total_permutation(string1,string2))
#12
Here's a dumb way to do it with a regex (don't actually do this).
Use a non capturing group for each letter in the search text, and then expect one of each captured group to appear in the output:
import re
string1 = 'abcdbcabdcabb'
string2 = r'(?:c()|a()|b()){3}\1\2\3'
pos = 0
r = re.compile(string2)
while m := r.search(string1, pos=pos):
print(m.group())
pos = m.start() + 1
abc
bca
cab
cab
Can also dynamically generate it
import re
string1 = 'abcdbcabdcabb'
string2 = 'cab'
before = "|".join([f"{l}()" for l in string2])
matches = "".join([f"\\{i + 1}" for i in range(len(string2))])
r = re.compile(f"(?:{before}){{{len(string2)}}}{matches}")
pos = 0
while m := r.search(string1, pos=pos):
print(m.group())
pos = m.start() + 1

Remove punctuation items from end of string

I have a seemingly simple problem, which I cannot seem to solve. Given a string containing a DOI, I need to remove the last character if it is a punctuation mark until the last character is letter or number.
For example, if the string was:
sampleDoi = "10.1097/JHM-D-18-00044.',"
I want the following output:
"10.1097/JHM-D-18-00044"
ie. remove .',
I wrote the following script to do this:
invalidChars = set(string.punctuation.replace("_", ""))
a = "10.1097/JHM-D-18-00044.',"
i = -1
for each in reversed(a):
if any(char in invalidChars for char in each):
a = a[:i]
i = i - 1
else:
print (a)
break
However, this produces 10.1097/JHM-D-18-00 but I would like it to produce 10.1097/JHM-D-18-00044. Why is the 44 removed from the end?
The string function rstrip() is designed to do exactly this:
>>> sampleDoi = "10.1097/JHM-D-18-00044.',"
>>> sampleDoi.rstrip(",.'")
'10.1097/JHM-D-18-00044'
Corrected code:
import string
invalidChars = set(string.punctuation.replace("_", ""))
a = "10.1097/JHM-D-18-00044.',"
i = -1
for each in reversed(a):
if any(char in invalidChars for char in each):
a = a[:i]
i = i # Well Really this line can just be removed all together.
else:
print (a)
break
This gives the output you want, while keeping the original code mostly the same.
This is one way using next and str.isalnum with a generator expression utilizing enumerate / reversed.
sampleDoi = "10.1097/JHM-D-18-00044.',"
idx = next((i for i, j in enumerate(reversed(sampleDoi)) if j.isalnum()), 0)
res = sampleDoi[:-idx]
print(res)
'10.1097/JHM-D-18-00044'
The default parameter 0is used so that, if no alphanumeric character is found, an empty string is returned.
If you dont wanna use regex:
the_str = "10.1097/JHM-D-18-00044.',"
while the_str[-1] in string.punctuation:
the_str = the_str[:-1]
Removes the last character until it's no longer a punctuation character.

Count spaces in text (treat consecutive spaces as one)

How would you count the number of spaces or new line charaters in a text in such a way that consecutive spaces are counted only as one?
For example, this is very close to what I want:
string = "This is an example text.\n But would be good if it worked."
counter = 0
for i in string:
if i == ' ' or i == '\n':
counter += 1
print(counter)
However, instead of returning with 15, the result should be only 11.
The default str.split() function will treat consecutive runs of spaces as one. So simply split the string, get the size of the resulting list, and subtract one.
len(string.split())-1
Assuming you are permitted to use Python regex;
import re
print len(re.findall(ur"[ \n]+", string))
Quick and easy!
UPDATE: Additionally, use [\s] instead of [ \n] to match any whitespace character.
You can do this:
string = "This is an example text.\n But would be good if it worked."
counter = 0
# A boolean flag indicating whether the previous character was a space
previous = False
for i in string:
if i == ' ' or i == '\n':
# The current character is a space
previous = True # Setup for the next iteration
else:
# The current character is not a space, check if the previous one was
if previous:
counter += 1
previous = False
print(counter)
re to the rescue.
>>> import re
>>> string = "This is an example text.\n But would be good if it worked."
>>> spaces = sum(1 for match in re.finditer('\s+', string))
>>> spaces
11
This consumes minimal memory, an alternative solution that builds a temporary list would be
>>> len(re.findall('\s+', string))
11
If you only want to consider space characters and newline characters (as opposed to tabs, for example), use the regex '(\n| )+' instead of '\s+'.
Just store a character that was the last character found. Set it to i each time you loop. Then within your inner if, do not increase the counter if the last character found was also a whitespace character.
You can iterate through numbers to use them as indexes.
for i in range(1, len(string)):
if string[i] in ' \n' and string[i-1] not in ' \n':
counter += 1
if string[0] in ' \n':
counter += 1
print(counter)
Pay attention to the first symbol as this constuction starts from the second symbol to prevent IndexError.
You can use enumerate, checking the next char is not also whitespace so consecutive whitespace will only count as 1:
string = "This is an example text.\n But would be good if it worked."
print(sum(ch.isspace() and not string[i:i+1].isspace() for i, ch in enumerate(string, 1)))
You can also use iter with a generator function, keeping track of the last character and comparing:
def con(s):
it = iter(s)
prev = next(it)
for ele in it:
yield prev.isspace() and not ele.isspace()
prev = ele
yield ele.isspace()
print(sum(con(string)))
An itertools version:
string = "This is an example text.\n But would be good if it worked. "
from itertools import tee, izip_longest
a, b = tee(string)
next(b)
print(sum(a.isspace() and not b.isspace() for a,b in izip_longest(a,b, fillvalue="") ))
Try:
def word_count(my_string):
word_count = 1
for i in range(1, len(my_string)):
if my_string[i] == " ":
if not my_string[i - 1] == " ":
word_count += 1
return word_count
You can use the function groupby() to find groups of consecutive spaces:
from collections import Counter
from itertools import groupby
s = 'This is an example text.\n But would be good if it worked.'
c = Counter(k for k, _ in groupby(s, key=lambda x: ' ' if x == '\n' else x))
print(c[' '])
# 11

python: Finding a substring within a string

I'm trying to write a program which counts how many times a substring appears within a string.
word = "wejmfoiwstreetstreetskkjoih"
streets = "streets"
count = 0
if streets in word:
count += 1
print(count)
as you can see "streets" appears twice but the last s of streets is also the beginning of streets. I can't think of a way to loop this.
Thanks!
Can be done using a regex
>>> import re
>>> text = 'streetstreets'
>>> len(re.findall('(?=streets)', text))
2
From the docs:
(?=...)
Matches if ... matches next, but doesn’t consume any of the
string. This is called a lookahead assertion. For example, Isaac
(?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
Quick and dirty:
>>> word = "wejmfoiwstreetstreetskkjoih"
>>> streets = "streets"
>>> sum(word[start:].startswith(streets) for start in range(len(word)))
2
A generic (though not as elegant) way would be a loop like this:
def count_substrings(stack, needle):
idx = 0
count = 0
while True:
idx = stack.find(needle, idx) + 1 # next time look after this idx
if idx <= 0:
break
count += 1
return count
My measurement shows that it's ~8.5 times faster than the solution with startswith for every substring.

python count repeating characters in a string by using dictionary function

I have a string
string = 'AAA'
When using the string.count('A') the output is equal to 3
and if it is string.count('AA') the output is equal to 1
However, there are 2 'AA's in the string.
Is there any method to count repeated string like above by using dictionary function?
I'd like to hear your helpful suggestions.
Thank you all in advance.
The problem is Count return the number of (non-overlapping) occurrences of substring sub in string.
try this as you can see at this post:
def occurrences(string, sub):
count = start = 0
while True:
start = string.find(sub, start) + 1
if start > 0:
count+=1
else:
return count
Alternative for "not huge" strings
>>> s, sub = 'AAA', 'AA'
>>> sum(s[x:].startswith(sub) for x in range(len(s)))
2
I find this a little more readable.
Yeah! You can use a dictionary
def count(input):
a = {}
for i in input:
a[i] = a.get(i,0)+ 1
return a
print(count('AA')) #Would return 2

Categories