Average time a substring occurs

Average time a substring occurs - python

I have a programm that returns too many results, so i want to take only the useful results that are above the average. My question is in a string length N that is produced from alphabet of k letters, how many times in average all substrings length m occurs? For example in the string "abcbbbbcbabcabcbcab" of alphabet {a,b,c} how many times in average all the substrings of length 3 occurs, abc occurs 3 times, bbb occurs 2 times (i count it even if they overlap), and so on. Or is there a way to know it from python (where my code is) before executing the programm?

Do you want to count the substrings in a specific string, or do you want the theoretical average in the general case? The probability that a string with length m of an alphabet with k characters occurs at any given position is 1/(k^m), so if your string is N characters long, that would make an expected number of occurrences of(N-m+1)/(k^m) (-m+1 because the string can not appear in the last m-1 positions). Another way to see this is as the number of substrings of length m (N-m+1) divided by the number of different such substrings (k^m).
You can calculate the average counts for your example, to see whether the formula gets to about the right result. Of course, one should not expect too much, as it's a very small sample size...
>>> s = "abcbbbbcbabcabcbcab"
>>> N = len(s)
>>> k = 3
>>> m = 3
For this, the formula gives us
>>> (N-m+1)/(k**m)
0.6296296296296297
We can count the occurrences for all the three-letter strings using itertools.product and a count function (str.count will not count overlapping strings correctly):
>>> count = lambda x: sum(s[i:i+m] == x for i in range(len(s)))
>>> X = [''.join(cs) for cs in itertools.product("abc", repeat=3)]
>>> counts = [count(x) for x in X]
In this case, this gives you exactly the same result as the formula. (I'm just as surprised as you.)
>>> sum(counts)/len(counts)
0.6296296296296297

Related

Time Complexity for LeetCode 3. Longest Substring Without Repeating Characters

Problem: Given a string s, find the length of the longest substring
without repeating characters.
Example: Input: s = "abcabcbb" Output: 3 Explanation: The answer is
"abc", with the length of 3.
My solution:
class Solution:
def lengthOfLongestSubstring(self, s: str) -> int:
seen = set()
l = r = curr_len = max_len = 0
n = len(s)
while l < n:
if r < n and s[r] not in seen:
seen.add(s[r])
curr_len += 1
max_len = max(curr_len, max_len)
r += 1
else:
l += 1
r = l
curr_len = 0
seen.clear()
return max_len
I know this is not an efficient solution, but I am having trouble figuring out its time complexity.
I visit every character in the string but, for each one of them, the window expands until it finds a repeated char. So every char ends up being visited multiple times, but not sure if enough times to justify an O(n2) time complexity and, obviously, it's way worse than O(n).

You could claim the algorithm to be O(n) if you know the size of the character set your input can be composed of, because the length your window can expand is limited by the number of different characters you could pass over before encountering a duplicate, and this is capped by the size of the character set you're working with, which itself is some constant independent of the length of the string. For example, if you are only working with lower case alphabetic characters, the algorithm is O(26n) = O(n).
To be more exact you could say that it runs in O(n*(min(m,n)) where n is the length of the string and m is the number of characters in the alphabet of the string. The reason for the min is that even if you're somehow working with an alphabet of unlimited unique characters, at worst you're doing a double for loop to the end of the string. That means however that if the number of possible characters you can encounter in the string exceeds the string's length you have a worst case O(n^2) performance (which occurs when every character of the string is unique).

How can I optimize this for-loop?

I need to check the occurrences of the letter "a" in a string s of size n.
Example:
s = "abcac"
n = 10
String to check for occurrences of letter "a": "abcacabcac".
Occurrences: 4
My code works, but I need it to work faster for larger values of n.
What can I do to optimize this code?
def repeatedString(s, n):
a_count, word_iter = 0, 0
for i in range(n):
if s[word_iter] == "a":
a_count+=1
word_iter += 1
if word_iter == (len(s)):
word_iter = 0
return a_count

You only don't need to assemble the full repeated string to do it. count the number of the specified characted in the whole string and multiple that by the number of times it will be fully repeated (n//len(s) times). Add to that the number of occurrences that will appear in the last (truncated) part at the end of the repetitions (i.e. first n%len(s) characters)
def countChar(s,n,c):
return s.count(c)*n//len(s)+s[:n%len(s)].count(c)
output:
countChar("abcac",10,"a") # 4 times in 'abcacabcac'
countChar("abcac",17,"a") # 7 times in 'abcacabcacabcacab'

Count the number of times a appears in a string, s up to length n
s = "abcac"
n = 10
str(s*(int(n/len(s))))[:n].count('a')

You can use regular expressions:
import re
a_count = len(re.findall(r'a',s))
re.findall returns an array of all matches, and we can just get the length of it. Using a regular expression allows for greater generalization and the ability to search for more complex patterns. Debra's original answer is better for a simple string search though:
a_count = s.count('a')

Count max consecutive RE groups in a string [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 3 years ago.
How can I count the max amount of consecutive string groups in a string?
import re
s = "HELLOasdHELLOasdHELLOHELLOHELLOasdHELLOHELLO"
# Give me the max amount of consecutive HELLO groups ---> wich is 3
# There's a group of 3 and a group of 2, but 3 is the max.
count = re.findall("(HELLO)+", s) # count is: ['HELLO', 'HELLO', 'HELLO', 'HELLO']
count = len(count)
print(count)
Output is:
4
Which is totally wrong. The max amount of consecutive HELLO is 3.
I think I'm using the wrong RE and I have no clue how to count those repetitions in order to find the max.
And I can't understand why the output is 4.
Thanks!

You need to capture the entire string of consecutive HELLOs in your match; then you can work out the number of HELLOs by dividing the length of the match string by 5 (the length of HELLO). Using a list comprehension:
import re
s = "HELLOasdHELLOasdHELLOHELLOHELLOasdHELLOHELLO"
print(max([len(x) // 5 for x in re.findall(r'((?:HELLO)+)', s)]))
Output
3

I think you should change to another solution that is easier to understand than looking for short code.
s = "HELLOasdHELLOasdHELLOHELLOHELLOasdHELLOHELLO"
word_search = "HELLO"
def find_char(str_var: str, word_search: str) -> int:
count = 0
for i in range(len(s)):
char = word_search * i
if str_var.find(char) != - 1:
count = i
return count
find = find_char(s)
print(find) # 3
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Update:
Actually it can work with one line of code without requiring additional modules:
c = max([i for i in range(len(s)) if s.find('HELLO' * i) != -1])
There are many cases where people use Regular Expressions to cool and reject simpler solutions to work.
Back to the topic of the topic owner. He doesn't understand why the output = 4.
Because re.findall () will return a list of elements that it finds, here, it finds 4 elements.
And len (list) will return the total number of elements in a list -> output = 4.
with the regex of the above answers, it also found 4 elements and when assembled into the top code, len(list) = 4.
We need to explain that len(list) is not right.
The problem is in the form of find max(x) if x * 'HELLO' in s.
Regex is a powerful and useful tool! but we don't always use it.

As explained in this question: Regex behaving weird when finding floating point strings
If one or more groups are present in the pattern, [re.findall will] return a list of groups
Therefore you want a non-capturing group instead. Let's work through with the sample string and pattern:
s = 'HELLOasdHELLOasdHELLOHELLOHELLOasdHELLOHELLO'
p = 'HELLO'
To find all occurrences of consecutive repetitions of the pattern, we just need to modify your regex slight to use a non-capturing group:
>>> matches = re.findall(f'(?:{p})+', s)
>>> matches
['HELLO', 'HELLO', 'HELLOHELLOHELLO', 'HELLOHELLO']
Now we just need to find the longest string and divide its length by the length of the pattern:
>> max(map(len, matches)) // len(p)
3

Word ranking partial completion [duplicate]

This question already has answers here:
Finding the ranking of a word (permutations) with duplicate letters
(6 answers)
Closed 8 years ago.
I am not sure how to solve this problem within the constraints.
Shortened problem formulation:
"Word" as any sequence of capital letters A-Z (not limited to just "dictionary words").
Consider list of permutations of all characters in a word, sorted lexicographically
Find a position of original word in such a list
Do not generate all possible permutations of a word, since it won't fit in time-memory constraints.
Constraints: word length <= 25 characters; memory limit 1Gb, any answer should fit in 64-bit integer
Original problem formulation:
Consider a "word" as any sequence of capital letters A-Z (not limited to just "dictionary words"). For any word with at least two different letters, there are other words composed of the same letters but in a different order (for instance, STATIONARILY/ANTIROYALIST, which happen to both be dictionary words; for our purposes "AAIILNORSTTY" is also a "word" composed of the same letters as these two). We can then assign a number to every word, based on where it falls in an alphabetically sorted list of all words made up of the same set of letters. One way to do this would be to generate the entire list of words and find the desired one, but this would be slow if the word is long. Write a program which takes a word as a command line argument and prints to standard output its number. Do not use the method above of generating the entire list. Your program should be able to accept any word 25 letters or less in length (possibly with some letters repeated), and should use no more than 1 GB of memory and take no more than 500 milliseconds to run. Any answer we check will fit in a 64-bit integer.
Sample words, with their rank:
ABAB = 2
AAAB = 1
BAAA = 4
QUESTION = 24572
BOOKKEEPER = 10743
examples:
AAAB - 1
AABA - 2
ABAA - 3
BAAA - 4
AABB - 1
ABAB - 2
ABBA - 3
BAAB - 4
BABA - 5
BBAA - 6
I came up with I think is only a partial solution.
Imagine I have the word JACBZPUC. I sort the word and get ABCCJPUZ This should be rank 1 in the word rank. From ABCCJPUZ to the first alphabetical word right before the word starting with J I want to find the number of permutations between the 2 words.
ex:
for `JACBZPUC`
sorted --> `ABCCJPUZ`
permutations that start with A -> 8!/2!
permutations that start with B -> 8!/2!
permutations that start with C -> 8!/2!
Add the 3 values -> 60480
The other C is disregarded as the permutations would have the same values as the previous C (duplicates)
At this point I have the ranks from ABCCJPUZ to the word right before the word that starts with J
ABCCJPUZ rank 1
...
... 60480 values
...
*HERE*
JABCCJPUZ rank 60481 LOCATION A
...
...
...
JACBZPUC rank ??? LOCATION B
I'm not sure how to get the values between Locations A and B:
Here is my code to find the 60480 values
def perm(word):
return len(set(itertools.permutations(word)))
def swap(word, i, j):
word = list(word)
word[i], word[j] = word[j], word[i]
print word
return ''.join(word)
def compute(word):
if ''.join(sorted(word)) == word:
return 1
total = 0
sortedWord = ''.join(sorted(word))
beforeFirstCharacterSet = set(sortedWord[:sortedWord.index(word[0])])
print beforeFirstCharacterSet
for i in beforeFirstCharacterSet:
total += perm(swap(sortedWord,0,sortedWord.index(i)))
return total
Here is a solution I found online to solve this problem.
Consider the n-letter word { x1, x2, ... , xn }. My solution is based on the idea that the word number will be the sum of two quantities:
The number of combinations starting with letters lower in the alphabet than x1, and
how far we are into the the arrangements that start with x1.
The trick is that the second quantity happens to be the word number of the word { x2, ... , xn }. This suggests a recursive implementation.
Getting the first quantity is a little complicated:
Let uniqLowers = { u1, u2, ... , um } = all the unique letters lower than x1
For each uj, count the number of permutations starting with uj.
Add all those up.
I think I complete step number 1 but not number 2. I am not sure how to complete this part
Here is the Haskell solution...I don't know Haskell =/ and I am trying to write this program in Python
https://github.com/david-crespo/WordNum/blob/master/comb.hs

The idea of finding the number of prmutations of the letters before the actual first letter is good.But your calculation:
for `JACBZPUC`
sorted --> `ABCCJPUZ`
permutations that start with A -> 8!/2!
permutations that start with B -> 8!/2!
permutations that start with C -> 8!/2!
Add the 3 values -> 60480
is wrong. There are only 8!/2! = 20160 permutations of JACBZPUC, so the starting position can't be greater than 60480. In your method, the first letter is fixed, you can only permute the seven following letters. So:
permutations that start with A: 7! / 2! == 2520
permutations that start with B: 7! / 2! == 2520
permutations that start with C: 7! / 1! == 5040
-----
10080
You don't divide by 2! to find the permutations beginning with C, because the seven remaning letters are unique; there's only one C left.
Here's a Python implementation:
def fact(n):
"""factorial of n, n!"""
f = 1
while n > 1:
f *= n
n -= 1
return f
def rrank(s):
"""Back-end to rank for 0-based rank of a list permutation"""
# trivial case
if len(s) < 2: return 0
order = s[:]
order.sort()
denom = 1
# account for multiple occurrences of letters
for i, c in enumerate(order):
n = 1
while i + n < len(order) and order[i + n] == c:
n += 1
denom *= n
# starting letters alphabetically before current letter
pos = order.index(s[0])
#recurse to list without its head
return fact(len(s) - 1) * pos / denom + rrank(s[1:])
def rank(s):
"""Determine 1-based rank of string permutation"""
return rrank(list(s)) + 1
strings = [
"ABC", "CBA",
"ABCD", "BADC", "DCBA", "DCAB", "FRED",
"QUESTION", "BOOKKEEPER", "JACBZPUC",
"AAAB", "AABA", "ABAA", "BAAA"
]
for s in strings:
print s, rank(s)

The second part of the solution you have found is also --I think-- what I was about to suggest:
To go from what you call "Location A" to "Location B", you have to find the position of word ACBZPUC among its possible permutations. Consider that a new question to your algorithm, with a new word that just happens to be one position shorter than the original one.

The words in the alphabetical list between JABCCPUZ, which you know the position of, and JACBZPUC, which you want to find the position of, all start with J. Finding the position of JACBZPUC relative to JABCCPUZ, then, is equivalent to finding the relative positions of those two words with the initial J removed, which is the same as the problem you were trying to solve initially but with a word one character shorter.
Repeat that process enough times and you will be left with a word that contains a single character, C. The position of a word with a single character is known to always be 1, so you can then sum that and all of the previous relative positions for an absolute position.

Python - build new string of specific length with n replacements from specific alphabet

I have been working on a fast, efficient way to solve the following problem, but as of yet, I have only been able to solve it using a rather slow, nest-loop solution. Anyways, here is the description:
So I have a string of length L, lets say 'BBBX'. I want to find all possible strings of length L, starting from 'BBBX', that differ at, at most, 2 positions and, at least, 0 positions. On top of that, when building the new strings, new characters must be selected from a specific alphabet.
I guess the size of the alphabet doesn't matter, so lets say in this case the alphabet is ['B', 'G', 'C', 'X'].
So, some sample output would be, 'BGBG', 'BGBC', 'BBGX', etc. For this example with a string of length 4 with up to 2 substitutions, my algorithm finds 67 possible new strings.
I have been trying to use itertools to solve this problem, but I am having a bit of difficulty finding a solution. I try to use itertools.combinations(range(4), 2) to find all the possible positions. I am then thinking of using product() from itertools to build all of the possibilities, but I am not sure if there is a way I could connect it somehow to the indices from the output of combinations().

Here's my solution.
The first for loop tells us how many replacements we will perform. (0, 1 or 2 - we go through each)
The second loop tells us which letters we will change (by their indexes).
The third loop goes through all of the possible letter changes for those indexes. There's some logic to make sure we actually change the letter (changing "C" to "C" doesn't count).
import itertools
def generate_replacements(lo, hi, alphabet, text):
for count in range(lo, hi + 1):
for indexes in itertools.combinations(range(len(text)), count):
for letters in itertools.product(alphabet, repeat=count):
new_text = list(text)
actual_count = 0
for index, letter in zip(indexes, letters):
if new_text[index] == letter:
continue
new_text[index] = letter
actual_count += 1
if actual_count == count:
yield ''.join(new_text)
for text in generate_replacements(0, 2, 'BGCX', 'BBBX'):
print text
Here's its output:
BBBX GBBX CBBX XBBX BGBX BCBX BXBX BBGX BBCX BBXX BBBB BBBG BBBC GGBX
GCBX GXBX CGBX CCBX CXBX XGBX XCBX XXBX GBGX GBCX GBXX CBGX CBCX CBXX
XBGX XBCX XBXX GBBB GBBG GBBC CBBB CBBG CBBC XBBB XBBG XBBC BGGX BGCX
BGXX BCGX BCCX BCXX BXGX BXCX BXXX BGBB BGBG BGBC BCBB BCBG BCBC BXBB
BXBG BXBC BBGB BBGG BBGC BBCB BBCG BBCC BBXB BBXG BBXC

Not tested much, but it does find 67 for the example you gave. The easy way to connect the indices to the products is via zip():
def sub(s, alphabet, minsubs, maxsubs):
from itertools import combinations, product
origs = list(s)
alphabet = set(alphabet)
for nsubs in range(minsubs, maxsubs + 1):
for ix in combinations(range(len(s)), nsubs):
prods = [alphabet - set(origs[i]) for i in ix]
s = origs[:]
for newchars in product(*prods):
for i, char in zip(ix, newchars):
s[i] = char
yield "".join(s)
count = 0
for s in sub('BBBX', 'BGCX', 0, 2):
count += 1
print s
print count
Note: the major difference from FogleBird's is that I posted first - LOL ;-) The algorithms are very similar. Mine constructs the inputs to product() so that no substitution of a letter for itself is ever attempted; FogleBird's allows "identity" substitutions, but counts how many valid substitutions are made and then throws the result away if any identity substitutions occurred. On longer words and a larger number of substitutions, that can be much slower (potentially the difference between len(alphabet)**nsubs and (len(alphabet)-1)**nsubs times around the ... in product(): loop).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Average time a substring occurs - python

Related

Time Complexity for LeetCode 3. Longest Substring Without Repeating Characters

How can I optimize this for-loop?

Count max consecutive RE groups in a string [duplicate]

Word ranking partial completion [duplicate]

Python - build new string of specific length with n replacements from specific alphabet

Categories

Resources