Posting this question after looking around different blogs/SO questions and still not finding an answer
I'm trying to wrap my head around a solution/algorithm used from a leetcode contest.
Here is the question:
Given two strings A and B, find the minimum number of times A has to be repeated such that B is a substring of it. If no such solution, return -1.
For example, with A = "abcd" and B = "cdabcdab".
Return 3, because by repeating A three times (“abcdabcdabcd”), B is a substring of it; and B is not a substring of A repeated two times ("abcdabcd").
I know that rolling hash approach is the preferred way to go but I decided to start with the Boyer Moore approach.
After doing some research I learned that the following code used Boyer Moore Algorithm behind the scenes. Here is the code:
def repeatedStringMatch(self, A, B):
"""
:type A: str
:type B: str
:rtype: int
"""
m = math.ceil(len(B)/len(A))
if B in A * m:
return m
elif B in A * (m + 1):
return m + 1
else:
return -1
Based on my understanding of the algorithm I'm not convinced how the above solution might be using BM approach.
I'm specifically not clear what this line
m = math.ceil(len(B)/len(A))
does and why we have to find m in this fashion. Could anyone help me here?
Thanks in advance
The smaller string must be repeated at least m times for the larger string to be contained within.
The minimum value that m can assume is the smallest integer greater than the ratio of the lengths of two strings (larger / smaller), because, to contain a substring of length l, a string must have at least length l.
Using the example you shared.
A = "abcd"
B = "cdabcdab"
m0 = len(B) / len(A)
# m0 == 2.0
m = math.ceil(m0)
# m = 2
However, "cdabcdab" is not contained in "abcdabcd". But, if we repeat "abcd" 1 more time, we find that "cdabcdab" is then a substring. Repeating "abcd" further doesn't change whether it may be found as a substring or not. So, it is only necessary to check containment for m & (m+1) repetitions.
The python code you shared does not implement any searching algorithm, it just uses whatever searching algorithm is implemented for use with in. The specific algorithm might be implementation dependent, however there is a high likelihood for it to be boyer-moore or a variation of, as it is popular & efficient.
edit:
the algorithm behind B in A appears to be boyer-moore in cpython
Related
I have been trying to figure out how to code the following program using an NFA, (I have been at it for a week and yet to make progress on implementing correctly, currently have an inefficient bruteforce solution that takes a very long time without an NFA and was wondering if there would be some guidance on here to figure out the implementation of this problem)
The Problem in regards to being solved by an NFA goes as follows, , given an integer N and a subset D ⊆ {0, 1, … , 9} of digits, if N can be written as a sum of two integers X and Y that use only digits in D. For example, consider N = 130500633 and D = {1, 2, 8}. Since 130500633 = 128218812 + 2281821, the answer is yes. For a given D, define L(D) as the set of integers N that can be written as X + Y where X and Y use only digits in D. Build an NFA M for L(D), and implement a membership algorithm to test if N is in L(M). Assuming that the input can up to 100 digits long.
Given a minimum length N and a string S of 1's and 0's (e.g. "01000100"), I am trying to return the number of non-overlapping occurrences of a sub-string of length n containing all '0's. For example, given n=2 and the string "01000100", the number of non-overlapping "00"s is 2.
This is what I have done:
def myfunc(S,N):
return S.count('0'*N)
My question: is there a faster way of performing this for very long strings? This is from an online coding practice site and my code passes all but one of the test cases, which fails due to not being able to finish within a time limit. Doing some research it seems I can only find that count() is the fastest method for this.
This might be faster:
>>> s = "01000100"
>>> def my_count( a, n ) :
... parts = a.split('1')
... return sum( len(p)//n for p in parts )
...
>>> my_count(s, 2)
2
>>>
Worst case scenario for count() is O(N^2), the function above is strictly linear O(N). Here's the discussion where O(N^2) number came from: What's the computational cost of count operation on strings Python?
Also, you may always do this manually, without using split(), just loop over the string, reset counter (once saved counter // n somewhere) on 1 and increase counter on 0. This would definitely beat any other approach because strictly O(N).
Finally, for relatively large values of n (n > 10 ?), there might be a sub-linear (or still linear, but with a smaller constant) algorithm, which starts with comparing a[n-1] to 0, and going back to beginning. Chances are, there going to be a 1 somewhere, so we don't have to analyse the beginning of the string if a[n-1] is 1 -- simply because there's no way to fit enough zeros in there. Assuming we have found 1 at position k, the next position to compare would be a[k+n-1], again going back to the beginning of the string.
This way we can effectively skip most of the string during the search.
lenik posted a very good response that worked well. I also found another method faster than count() that I will post here as well. It uses the findall() method from the regex library:
import re
def my_count(a, n):
return len(re.findall('0'*n, a))
Let's assume we have an alphabet of 20 letters. Also let's assume that we have the following substring CCAY. I would like to calculate the number of the words which have length N letters and include the specific substring.
To be more precise, if the N = 6 I would like the following combinations CCAYxx, xCCAYx, xxCCAY where x is any letter of the alphabet. If N = 7 the combinations adjust as follows CCAYxxx, xCCAYxx, xxCCAYx, xxxCCAY and so on.
Also, I can think a pitfall when the substring consists of only one letter of the alphabet e.g CCCC which means that in case of N = 6 the string CCCCCC should not be counted multiple times.
I would appreciate any help or guidance on how to approach this problem. Any sample code in python would be also highly appreciated.
You said brute force is okay, so here we go:
alphabet = 'abc'
substring = 'ccc'
n = 7
res = set()
for combination in itertools.product(alphabet, repeat=n-len(substring)):
# get the carthesian product of the alphabet such that we end up
# with a total length of 'n' for the final combination
for idx in range(len(combination)+1):
res.add(''.join((*combination[:idx], substring, *combination[idx:])))
print(len(res))
Prints:
295
For a substring with no repetitions, like abc, I get 396 as result, so I assume it covers to corner case appropriately.
That this is inefficient enough to make mathematicians weep goes without saying, but as long as your problems are small in length it should get the job done.
Analytical approach
The maximum number of combinations is given by the ways of unique ordered combinations of length n, given len(alphabet) = k symbols, which is k^n. Additionally, the 'substring' can be inserted into the combinations at any point, which leads to a total maximum of (n+1)*k^n. The latter only holds if the substring does not produce identical final combinations at any point, which makes this problem hard to compute analytically. So, the vague answer is your result will be somewhere between k^n and (n+1)*k^n.
If you want to count the number of identical final combinations that include the substring, you can do so by counting the number of repetitions of the substring within a preliminary product:
n = 6
pre_prod = 'abab'
sub = 'ab'
pre_prods = ['ababab', 'aabbab', 'ababab', 'abaabb', 'ababab']
prods = ['ababab', 'aabbab', 'abaabb']
# len(pre_prodd) - pre_prod.count(sub) -> len(prods) aka 5 - 2 = 3
I will see if I can find a formula for that .. sometime soon.
So recently I hit upon this programming problem which I couldn't seem to make the complexity less (my current code runs in O(n^2)).
Essentially, I have four different lists (I'm using python btw) of integers, both positive and negative, say lists A, B, C, D. Now, each of these lists has 1000 integers, and these integers range from -25000 to 25000 inclusive. Now, suppose from each of these lists we choose an integer, say a, b, c, d. I would like the quickest way to find these a, b, c, d such that a+b=-(c+d).
Currently, my method relies on iterating through every single possible combination of a, b, and c, d, before then trying to find if an element in the set (a+b) exists in the set -(c+d). This, of course, is impractical since it runs in O(n^2) time, even more so considering the large list sizes (1000).
Hence I was wondering if anyone could think of a more efficient way (preferably O(n log n) or smaller), coded in python if possible.
Apologies if it's rather confusing. If you have any questions please inform me, I'll try to provide more clarification.
EDIT:
This problem is part of a larger problem. The larger problem states that if we have 4 sequences of numbers with at most 1000 integers in each, say A, B, C, D, find an a, b, c, d such that a+b+c+d=0.
I asked the above question since a+b+c+d=0 implies that a+b=-(c+d), which I thought would lead to the fastest way to solve the problem. If anyone can think of an even faster way, please do share it with me.
Thanks in advance! :)
Your problem isn't that combining pairs of elements is O(n^2), but rather the fact that you're combining two such processes naively to end up with an O(n^4) algorithm. I'm going to assume you just need to find >= 1 ways to add up to 0 -- my method given below can easily be extended to find all ways, if it's required.
Given that you have a relatively narrow range of accepted values (-25k to +25k, let's call those MIN and MAX respectively), here's what you do:
Create 2 int arrays of size (MAX - MIN + 1), "indicesA" and "indicesB". That's not even 0.5 MB of memory all together, so nothing to worry about on a modern system.
Now loop on all elements of lists A and B, just like you were doing. Do something like this pseudo-code (not too familiar with python so I'm not sure if it's valid as-is):
for idxA, valA in enumerate(A):
for idxB, valB in enumerate(B):
indicesA[valA + valB - MIN] = idxA + 1
indicesB[valA + valB - MIN] = idxB + 1
Now just use this as an O(1) lookup-table when looping on B and C:
for valC in C:
for valD in D:
neededVal = -(valC + valD) - MIN
if indicesA[neededVal] > 0:
print('Found solution: {0} {1} {2} {3}'.format(A[indicesA[neededVal] - 1],
B[indicesB[neededVal] - 1], valC, valD))
Lookup-table initialization with 0s: O(MAX - MIN) (~50k, smaller than n^2 in this case)
Filling lookup-table by looping on A and B: O(n^2)
Looping on C and D and checking for any solutions: O(n^2)
Overall, O(n^2 + (MAX - MIN)) =~ O(n^2) with the values given. Probably can't do much better than that.
You have four arrays and you want to choose one number from each array, such that the sum of the four numbers is zero. The technical name for this problem is 4SUM×4.
This problem is at least as hard 3SUM×3, where three numbers summing to zero must be chosen from three arrays. An instance of 3SUM×3 can be converted to an instance of 4SUM×4 by simply adding an array of zeros. So any algorithm solving your problem can be used to solve 3SUM×3 in the same time complexity.
It doesn't appear to be known for certain that 3SUM×3 isn't easier than the more famous 3SUM problem, but it seems very likely to be equally difficult. A 3SUM×3 algorithm can be used to solve the 3SUM problem or determine with arbitrarily high probability that no solution exists. (The only issue with reducing 3SUM to 3SUM×3 is that 3SUM×3 allows solutions like 1, 1, -2 whereas 3SUM doesn't.) The theoretically-best known algorithms for 3SUM only beat O(n2) by factors of (log n) to some power.
Given all of that, it seems very unlikely that your problem can be solved in significantly less than O(n2) time, asymptotically.
EDIT: See Solving "Who owns the Zebra" programmatically? for a similar class of problem
There's a category of logic problem on the LSAT that goes like this:
Seven consecutive time slots for a broadcast, numbered in chronological order I through 7, will be filled by six song tapes-G, H, L, O, P, S-and exactly one news tape. Each tape is to be assigned to a different time slot, and no tape is longer than any other tape. The broadcast is subject to the following restrictions:
L must be played immediately before O.
The news tape must be played at some time after L.
There must be exactly two time slots between G and
P, regardless of whether G comes before P or whether G comes after P.
I'm interested in generating a list of permutations that satisfy the conditions as a way of studying for the test and as a programming challenge. However, I'm not sure what class of permutation problem this is. I've generalized the type problem as follows:
Given an n-length array A:
How many ways can a set of n unique items be arranged within A? Eg. How many ways are there to rearrange ABCDEFG?
If the length of the set of unique items is less than the length of A, how many ways can the set be arranged within A if items in the set may occur more than once? Eg. ABCDEF => AABCDEF; ABBCDEF, etc.
How many ways can a set of unique items be arranged within A if the items of the set are subject to "blocking conditions"?
My thought is to encode the restrictions and then use something like Python's itertools to generate the permutations. Thoughts and suggestions are welcome.
This is easy to solve (a few lines of code) as an integer program. Using a tool like the GNU Linear Programming Kit, you specify your constraints in a declarative manner and let the solver come up with the best solution. Here's an example of a GLPK program.
You could code this using a general-purpose programming language like Python, but this is the type of thing you'll see in the first few chapters of an integer programming textbook. The most efficient algorithms have already been worked out by others.
EDIT: to answer Merjit's question:
Define:
matrix Y where Y_(ij) = 1 if tape i
is played before tape j, and 0
otherwise.
vector C, where C_i
indicates the time slot when i is
played (e.g. 1,2,3,4,5,6,7)
Large
constant M (look up the term for
"big M" in an optimization textbook)
Minimize the sum of the vector C subject to the following constraints:
Y_(ij) != Y_(ji) // If i is before j, then j must not be before i
C_j < C_k + M*Y_(kj) // the time slot of j is greater than the time slot of k only if Y_(kj) = 1
C_O - C_L = 1 // L must be played immediately before O
C_N > C_L // news tape must be played at some time after L
|C_G - C_P| = 2 // You will need to manipulate this a bit to make it a linear constraint
That should get you most of the way there. You want to write up the above constraints in the MathProg language's syntax (as shown in the links), and make sure I haven't left out any constraints. Then run the GLPK solver on the constraints and see what it comes up with.
Okay, so the way I see it, there are two ways to approach this problem:
Go about writing a program that will approach this problem head first. This is going to be difficult.
But combinatorics teaches us that the easier way to do this is to count all permutations and subtract the ones that don't satisfy your constraints.
I would go with number 2.
You can find all permutations of a given string or list by using this algorithm. Using this algorithm, you can get a list of all permutations. You can now apply a number of filters on this list by checking for the various constraints of the problem.
def L_before_O(s):
return (s.index('L') - s.index('O') == 1)
def N_after_L(s):
return (s.index('L') < s.index('N'))
def G_and_P(s):
return (abs(s.index('G') - s.index('P')) == 2)
def all_perms(s): #this is from the link
if len(s) <=1:
yield s
else:
for perm in all_perms(s[1:]):
for i in range(len(perm)+1):
yield perm[:i] + s[0:1] + perm[i:]
def get_the_answer():
permutations = [i for i in all_perms('GHLOPSN')] #N is the news tape
a = [i for i in permutations if L_before_O(i)]
b = [i for i in a if N_after_L(i)]
c = [i for i in b if G_and_P(i)]
return c
I haven't tested this, but this is general idea of how I would go about coding such a question.
Hope this helps