python: Finding a substring within a string

python: Finding a substring within a string - python

I'm trying to write a program which counts how many times a substring appears within a string.
word = "wejmfoiwstreetstreetskkjoih"
streets = "streets"
count = 0
if streets in word:
count += 1
print(count)
as you can see "streets" appears twice but the last s of streets is also the beginning of streets. I can't think of a way to loop this.
Thanks!

Can be done using a regex
>>> import re
>>> text = 'streetstreets'
>>> len(re.findall('(?=streets)', text))
2
From the docs:
(?=...)
Matches if ... matches next, but doesn’t consume any of the
string. This is called a lookahead assertion. For example, Isaac
(?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

Quick and dirty:
>>> word = "wejmfoiwstreetstreetskkjoih"
>>> streets = "streets"
>>> sum(word[start:].startswith(streets) for start in range(len(word)))
2

A generic (though not as elegant) way would be a loop like this:
def count_substrings(stack, needle):
idx = 0
count = 0
while True:
idx = stack.find(needle, idx) + 1 # next time look after this idx
if idx <= 0:
break
count += 1
return count
My measurement shows that it's ~8.5 times faster than the solution with startswith for every substring.

Related

Find all permutation of a string in another string

I have a string, for example: string1 = 'abcdbcabdcabb'.
And I have another string, for example: string2 = 'cab'
I need to count all permutation of string2 in string1.
Currently I'm adding all permutation of string2 to a list,
and than iterating threw string1 by index+string.size and checking
if sub-string of string1 contain in the list of the permutations
I'm sure there is a better optimized way to do it.

You do not need DP in my mind, but a sliding window technic. A permutation of string2 is a string that has exactly the same length and the distribution of the characters is the same. In your example of string2, a permutation is. a string of length 3 with this distribution of characters: {a:1,b:1,c:1}.
So you can write a script, to consider a window of size N (size of string2), from the beginning of string1(index=0). if your current window has exactly the same distribution of characters, you accept it as a permutation, if not you do not count it, and you move on to index+1.
A trick for not recalculating the character distribution in each sliding window, you can get a dictionary of characters, and count the characters at the very first window, then when you slide the window one to the right, you decrease the removing character by one, and increase the adding character by 1.
The code should be something like this, you need to verify it for edge cases:
def get_permut(string1,string2):
N =len(string2)
M = len(string1)
if M < N:
return 0
valid_dist = dict()
for ch in string2:
valid_dist.setdefault(ch,0)
valid_dist[ch]+=1
current_dist=dict()
for ch in string1[:N]:
current_dist.setdefault(ch,0)
current_dist[ch]+=1
ct=0
for i in range(M-N):
if current_dist == valid_dist:
ct+=1
current_dist[i]-=1
current_dist.setdefault(i+1,0)
current_dist[i+1]+=1
if current_dist[i]==0:
del current_dist[i]
return ct

You can use string.count() method here. See below for some way to resolve it:
import itertools
perms=[''.join(i) for i in itertools.permutations(string2)]
res=0
for i in perms:
res+= string1.count(i)
print(res)
# 4

You can use regex for that.
def lst_permutation(text):
from itertools import permutations
lst_p=[]
for i in permutations(text):
lst_p.append(''.join(i))
return lst_p
def total_permutation(string1,string2):
import re
total=0
for i in lst_permutation(string2):
res=re.findall(string2,string1)
total += len(res)
return total
string1 = 'abcdbcabdcabb'
string2 = 'cab'
print(total_permutation(string1,string2))
#12

Here's a dumb way to do it with a regex (don't actually do this).
Use a non capturing group for each letter in the search text, and then expect one of each captured group to appear in the output:
import re
string1 = 'abcdbcabdcabb'
string2 = r'(?:c()|a()|b()){3}\1\2\3'
pos = 0
r = re.compile(string2)
while m := r.search(string1, pos=pos):
print(m.group())
pos = m.start() + 1
abc
bca
cab
cab
Can also dynamically generate it
import re
string1 = 'abcdbcabdcabb'
string2 = 'cab'
before = "|".join([f"{l}()" for l in string2])
matches = "".join([f"\\{i + 1}" for i in range(len(string2))])
r = re.compile(f"(?:{before}){{{len(string2)}}}{matches}")
pos = 0
while m := r.search(string1, pos=pos):
print(m.group())
pos = m.start() + 1

No. of times a sub-string occurs in the string

Below is my code:
def count_substring(string, sub_string):
counter = 0
for x in range(0,len(string)):
if string[x]+string[x+1]+string[x+2] == sub_string:
counter +=1
return counter
When I run the code it throws an error - "IndexError: string index out of range"
Please help me in understanding what is wrong with my code and also with the solution.
I am a beginner in Python. Please explain this to me like I am 5.

Can't you simple use str.count for non-overlapping matches:
str.count(substring, [start_index], [end_index])
full_str = 'Test for substring, check for word check'
sub_str = 'check'
print(full_str.count(sub_str))
Returns 2
If you have overlapping matches of your substring you could try re.findall with a positive lookahead:
import re
full_str = 'bobob'
sub_str = 'bob'
print(len(re.findall('(?='+sub_str+')',full_str)))
If you got the new regex.findall module and you want to count as such, try to use the overlapping parameter in re.findall and set it to true:
import regex as re
full_str = 'bobob'
sub_str = 'bob'
print(len(re.findall(sub_str, full_str, overlapped=True)))
Both options will return: 2

Couldn't you just use count? It uses way less code. See JvdV's answer. Also, by the way, this is how I can do it:
def count_substring(string, substring)
print(string.count(substring))
This simplifies code by a lot and also you could just get rid of the function entirely and do this:
print(string.count(substring)) # by the way you have to define string and substring first
If you want to include overlapping strings, then do this:
def count(string, substring):
string_size = len(string)
substring_size = len(substring)
count = 0
for i in xrange(0, string_size-substring_size+1):
if string[ i:i + substring_size] == substring:
count += 1
return count

String has built-in method count for this purpose.
string = 'This is the way to do it.'
string.count('is')
Output: 2

Is there an easy way to get the number of repeating character in a word?

I'm trying to get how many any character repeats in a word. The repetitions must be sequential.
For example, the method with input "loooooveee" should return 6 (4 times 'o', 2 times 'e').
I'm trying to implement string level functions and I can do it this way but, is there an easy way to do this? Regex, or some other sort of things?

Original question: order of repetition does not matter
You can subtract the number of unique letters by the number of total letters. set applied to a string will return a unique collection of letters.
x = "loooooveee"
res = len(x) - len(set(x)) # 6
Or you can use collections.Counter, subtract 1 from each value, then sum:
from collections import Counter
c = Counter("loooooveee")
res = sum(i-1 for i in c.values()) # 6
New question: repetitions must be sequential
You can use itertools.groupby to group sequential identical characters:
from itertools import groupby
g = groupby("aooooaooaoo")
res = sum(sum(1 for _ in j) - 1 for i, j in g) # 5
To avoid the nested sum calls, you can use itertools.islice:
from itertools import groupby, islice
g = groupby("aooooaooaoo")
res = sum(1 for _, j in g for _ in islice(j, 1, None)) # 5

You could use a regular expression if you want:
import re
rx = re.compile(r'(\w)\1+')
repeating = sum(x[1] - x[0] - 1
for m in rx.finditer("loooooveee")
for x in [m.span()])
print(repeating)
This correctly yields 6 and makes use of the .span() function.
The expression is
(\w)\1+
which captures a word character (one of a-zA-Z0-9_) and tries to repeat it as often as possible.
See a demo on regex101.com for the repeating pattern.
If you want to match any character (that is, not only word characters), change your expression to:
(.)\1+
See another demo on regex101.com.

try this:
word=input('something:')
sum = 0
chars=set(list(word)) #get the set of unique characters
for item in chars: #iterate over the set and output the count for each item
if word.count(char)>1:
sum+=word.count(char)
print('{}|{}'.format(item,str(word.count(char)))
print('Total:'+str(sum))
EDIT:
added total count of repetitions

Since it doesn't matter where the repetition is occurring or which characters are being repeated, you can make use of the set data structure provided in Python. It will discard the duplicate occurrences of any character or an object.
Therefore, the solution would look something like this:
def measure_normalized_emphasis(text):
return len(text) - len(set(text))
This will give you the exact result.
Also, make sure to look out for some edge cases, which you should as it is a good practice.

I think your code is comparing the wrong things
You start by finding the last character:
char = text[-1]
Then you compare this to itself:
for i in range(1, len(text)):
if text[-i] == char: #<-- surely this is test[-1] to begin with?
Why not just run through the characters:
def measure_normalized_emphasis(text):
char = text[0]
emphasis_size = 0
for i in range(1, len(text)):
if text[i] == char:
emphasis_size += 1
else:
char = text[i]
return emphasis_size
This seems to work.

Backward search implementation python

I am dealing with some string search tasks just to improve an efficient way of searching.
I am trying to implement a way of counting how many substrings there are in a given set of strings by using backward search.
For example given the following strings:
original = 'panamabananas$'
s = smnpbnnaaaaa$a
s1 = $aaaaaabmnnnps #sorted version of s
I am trying to find how many times the substring 'ban' it occurs. For doing so I was thinking in iterate through both strings with zip function. In the backward search, I should first look for the last character of ban (n) in s1 and see where it matches with the next character a in s. It matches in indexes 9,10 and 11, which actually are the third, fourth and fifth a in s. The next character to look for is b but only for the matches that occurred before (This means, where n in s1 matched with a in s). So we took those a (third, fourth and fifth) from s and see if any of those third, fourth or fifth a in s1 match with any b in s. This way we would have found an occurrence of 'ban'.
It seems complex to me to iterate and save cuasi-occurences so what I was trying is something like this:
n = 0 #counter of occurences
for i, j in zip(s1, s):
if i == 'n' and j == 'a': # this should save the match
if i[3:6] == 'a' and any(j[3:6] == 'b'):
n += 1
I think nested if statements may be needed but I am still a beginner. Because I am getting 0 occurrences when there are one ban occurrences in the original.

You can run a loop with find to count the number of occurence of substring.
s = 'panamabananasbananasba'
ss = 'ban'
count = 0
idx = s.find(ss, 0)
while (idx != -1):
count += 1
idx += len(ss)
idx = s.find(ss, idx)
print count
If you really want backward search, then reverse the string and substring and do the same mechanism.
s = 'panamabananasbananasban'
s = s[::-1]
ss = 'ban'
ss = ss[::-1]

Count spaces in text (treat consecutive spaces as one)

How would you count the number of spaces or new line charaters in a text in such a way that consecutive spaces are counted only as one?
For example, this is very close to what I want:
string = "This is an example text.\n But would be good if it worked."
counter = 0
for i in string:
if i == ' ' or i == '\n':
counter += 1
print(counter)
However, instead of returning with 15, the result should be only 11.

The default str.split() function will treat consecutive runs of spaces as one. So simply split the string, get the size of the resulting list, and subtract one.
len(string.split())-1

Assuming you are permitted to use Python regex;
import re
print len(re.findall(ur"[ \n]+", string))
Quick and easy!
UPDATE: Additionally, use [\s] instead of [ \n] to match any whitespace character.

You can do this:
string = "This is an example text.\n But would be good if it worked."
counter = 0
# A boolean flag indicating whether the previous character was a space
previous = False
for i in string:
if i == ' ' or i == '\n':
# The current character is a space
previous = True # Setup for the next iteration
else:
# The current character is not a space, check if the previous one was
if previous:
counter += 1
previous = False
print(counter)

re to the rescue.
>>> import re
>>> string = "This is an example text.\n But would be good if it worked."
>>> spaces = sum(1 for match in re.finditer('\s+', string))
>>> spaces
11
This consumes minimal memory, an alternative solution that builds a temporary list would be
>>> len(re.findall('\s+', string))
11
If you only want to consider space characters and newline characters (as opposed to tabs, for example), use the regex '(\n| )+' instead of '\s+'.

Just store a character that was the last character found. Set it to i each time you loop. Then within your inner if, do not increase the counter if the last character found was also a whitespace character.

You can iterate through numbers to use them as indexes.
for i in range(1, len(string)):
if string[i] in ' \n' and string[i-1] not in ' \n':
counter += 1
if string[0] in ' \n':
counter += 1
print(counter)
Pay attention to the first symbol as this constuction starts from the second symbol to prevent IndexError.

You can use enumerate, checking the next char is not also whitespace so consecutive whitespace will only count as 1:
string = "This is an example text.\n But would be good if it worked."
print(sum(ch.isspace() and not string[i:i+1].isspace() for i, ch in enumerate(string, 1)))
You can also use iter with a generator function, keeping track of the last character and comparing:
def con(s):
it = iter(s)
prev = next(it)
for ele in it:
yield prev.isspace() and not ele.isspace()
prev = ele
yield ele.isspace()
print(sum(con(string)))
An itertools version:
string = "This is an example text.\n But would be good if it worked. "
from itertools import tee, izip_longest
a, b = tee(string)
next(b)
print(sum(a.isspace() and not b.isspace() for a,b in izip_longest(a,b, fillvalue="") ))

Try:
def word_count(my_string):
word_count = 1
for i in range(1, len(my_string)):
if my_string[i] == " ":
if not my_string[i - 1] == " ":
word_count += 1
return word_count

You can use the function groupby() to find groups of consecutive spaces:
from collections import Counter
from itertools import groupby
s = 'This is an example text.\n But would be good if it worked.'
c = Counter(k for k, _ in groupby(s, key=lambda x: ' ' if x == '\n' else x))
print(c[' '])
# 11

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: Finding a substring within a string - python

Quick and dirty: >>> word = "wejmfoiwstreetstreetskkjoih" >>> streets = "streets" >>> sum(word[start:].startswith(streets) for start in range(len(word))) 2

Related

Find all permutation of a string in another string

No. of times a sub-string occurs in the string

Is there an easy way to get the number of repeating character in a word?

Backward search implementation python

Count spaces in text (treat consecutive spaces as one)

Categories

Resources