Find num of overlapping and non-overlapping substrings in a string

Find num of overlapping and non-overlapping substrings in a string - python

PS: This is not a duplicate of How to find the overlap between 2 sequences, and return it
[Although I ask for solutions in above approach if it could be applied to the following problem]
Q: Although I got it right, it is still not a scalable solution and is definitely not optimized (low on score). Read the following description of the problem and kindly offer better solution.
Question:
For simplicity, we require prefixes and suffixes to be non-empty and shorter than the whole string S. A border of a string S is any string that is both a prefix and a suffix. For example, "cut" is a border of a string "cutletcut", and a string "barbararhubarb" has two borders: "b" and "barb".
class Solution { public int solution(String S); }
that, given a string S consisting of N characters, returns the length of its longest border that has at least three non-overlapping occurrences in the given string. If there is no such border in S, the function should return 0.
For example,
if S = "barbararhubarb" the function should return 1, as explained above;
if S = "ababab" the function should return 2, as "ab" and "abab" are both borders of S, but only "ab" has three non-overlapping occurrences;
if S = "baaab" the function should return 0, as its only border "b" occurs only twice.
Assume that:
N is an integer within the range [0..1,000,000];
string S consists only of lower-case letters (a−z).
Complexity:
expected worst-case time complexity is O(N);
expected worst-case space complexity is O(N) (not counting the storage required for input arguments).
def solution(S):
S = S.lower()
presuf = []
f = l = str()
rank = []
wordlen = len(S)
for i, j in enumerate(S):
y = -i-1
f += S[i]
l = S[y] + l
if f==l and f != S:
#print f,l
new=S[i+1:-i-1]
mindex = new.find(f)
if mindex != -1:
mid = f #new[mindex]
#print mid
else:
mid = None
presuf.append((f,mid,l,(i,y)))
#print presuf
for i,j,k,o in presuf:
if o[0]<wordlen+o[-1]: #non overlapping
if i==j:
rank.append(len(i))
else:
rank.append(0)
if len(rank)==0:
return 0
else:
return max(rank)
My solutions time complexity is: O(N2) or O(N4)
Help greatly appreciated.

My solution is combination between Rabin-Karp and Knuth–Morris–Pratt algorithms.
http://codility.com/cert/view/certB6J4FV-W89WX4ZABTDRVAG6/details

I have a (Java) solution that performs O(N) or O(N**3), for a resulting 90/100 overall, but I can't figure out how to make it go though 2 different testcases:
almost_all_same_letters
aaaaa...aa??aaaa??....aaaaaaa 2.150 s. TIMEOUT ERROR
running time: >2.15 sec., time limit: 1.20 sec.
same_letters_on_both_ends 2.120 s. TIMEOUT ERROR
running time: >2.12 sec., time limit: 1.24 sec.
Edit: Nailed it!
Now I have a solution that perform in O(N) and passes all the checks for a 100/100 result :)
I didn't know Codility, but it's a nice tool!

I have a solution with suffix arrays (there actually is algorithm for constructing SA and LCP in linear time or something bit worse than that, but surely not quadratic).
Still not sure if I can go without RMQs ( O(log n) with SegmentTree) which I couldn't make pass my own cases and seems quite complicated, but with RMQs it can (not mentioning approach with for loop instead of RMQ, that would make it quadratic anyway).
Solution is performing quite fast and passing my 21 test cases with various perks I've managed to craft, but still failing on some of their cases. Not sure if that helped you or gave you idea how to approach the problem, but I am sure that naive solution, like #Vicenco said in some of his comments, can't get you better than Silver.
EDIT:
managed to fix it all problems, but still to slow. I had to enforce some conditions but had to increase complexity with this, still not sure how to optimize that. Will keep you posted. Good luck!

protected int calcBorder(String input) {
if (null != input) {
int mean = (input.length() / 3);
while (mean >= 1) {
if (input.substring(0, mean).equals(
input.substring(input.length() - mean))) {
String reference = input.substring(0, mean);
String temp = input
.substring(mean, (input.length() - mean));
int startIndex = 0;
int endIndex = mean;
int count = 2;
while (endIndex <= temp.length()) {
if (reference.equals(temp.substring(startIndex,
endIndex))) {
count++;
if (count >= 3) {
return reference.length();
}
}
startIndex++;
endIndex++;
}
}
mean--;
}
}
return 0;
}

The Z-Algorithm would be a good solution.

Related

Finding a permutation of one string in another: Xor solution strange behavior

I am working on the following problem:
I was inspired by the first answer to this question to come up with a solution that utilizes some of the properties of XOR (Identity, Commutative, and Self-Inverse) to work in O(n) time and O(1) space.
def checkInclusion(s1: str, s2: str) -> bool:
# Checks for permutation of s1 inside of s2.
# Xor's all of the characters in a s1-length window of s2
# If xor_product = 0 --> permutation identified
# Relies on properties of xor to find answer: identity, communtative, and self-inverse
xor_product = 0
for i in range(0, len(s2) - len(s1) + 1):
s1_index = 0
for j in range(i, i + len(s1)):
xor_product = xor_product ^ ord(s1[s1_index]) ^ ord(s2[j])
s1_index += 1
if xor_product == 0: return True
xor_product = 0
return False
This solution works for most inputs, but fails when s1 = "kitten" and s2 = "sitting". Is this solution conceptually flawed? If so, then how? If not, then what's the bug?
I'm admittedly new to coding interview style questions. All help appreciated.

Yes, xor-approach is flawed.
It is kind of simple hash, but this hash might be indentical for different strings (consider 6^7=1 and 3^2=1). In case of xor hash coincidence you need to check real similarity with other means - for example, with direct comparison of sorted string and substring, but this way is not appropriate for contest case - special tests with multiple identical hashes will cause slow work, the worst case time is too large.
Instead you can exploit approach with dictionary/counter. Update counters for every new item and for item leaving sliding window and check that all entries of counter have the same counts as sample.
P.S. Keeping NumberOfGoodCounters value helps to avoid checking of all counters at every step.

How do I find all 32 bit binary numbers that have exactly six 1 and rest 0

I could do this in brute force, but I was hoping there was clever coding, or perhaps an existing function, or something I am not realising...
So some examples of numbers I want:
00000000001111110000
11111100000000000000
01010101010100000000
10101010101000000000
00100100100100100100
The full permutation. Except with results that have ONLY six 1's. Not more. Not less. 64 or 32 bits would be ideal. 16 bits if that provides an answer.

I think what you need here is using the itertools module.
BAD SOLUTION
But you need to be careful, for instance, using something like permutations would just work for very small inputs. ie:
Something like the below would give you a binary representation:
>>> ["".join(v) for v in set(itertools.permutations(["1"]*2+["0"]*3))]
['11000', '01001', '00101', '00011', '10010', '01100', '01010', '10001', '00110', '10100']
then just getting decimal representation of those number:
>>> [int("".join(v), 16) for v in set(itertools.permutations(["1"]*2+["0"]*3))]
[69632, 4097, 257, 17, 65552, 4352, 4112, 65537, 272, 65792]
if you wanted 32bits with 6 ones and 26 zeroes, you'd use:
>>> [int("".join(v), 16) for v in set(itertools.permutations(["1"]*6+["0"]*26))]
but this computation would take a supercomputer to deal with (32! = 263130836933693530167218012160000000 )
DECENT SOLUTION
So a more clever way to do it is using combinations, maybe something like this:
import itertools
num_bits = 32
num_ones = 6
lst = [
f"{sum([2**vv for vv in v]):b}".zfill(num_bits)
for v in list(itertools.combinations(range(num_bits), num_ones))
]
print(len(lst))
this would tell us there is 906192 numbers with 6 ones in the whole spectrum of 32bits numbers.
CREDITS:
Credits for this answer go to #Mark Dickinson who pointed out using permutations was unfeasible and suggested the usage of combinations

Well I am not a Python coder so I can not post a valid code for you. Instead I can do a C++ one...
If you look at your problem you set 6 bits and many zeros ... so I would approach this by 6 nested for loops computing all the possible 1s position and set the bits...
Something like:
for (i0= 0;i0<32-5;i0++)
for (i1=i0+1;i1<32-4;i1++)
for (i2=i1+1;i2<32-3;i2++)
for (i3=i2+1;i3<32-2;i3++)
for (i4=i3+1;i4<32-1;i4++)
for (i5=i4+1;i5<32-0;i5++)
// here i0,...,i5 marks the set bits positions
So the O(2^32) become to less than `~O(26.25.24.23.22.21/16) and you can not go faster than that as that would mean you miss valid solutions...
I assume you want to print the number so for speed up you can compute the number as a binary number string from the start to avoid slow conversion between string and number...
The nested for loops can be encoded as increment operation of an array (similar to bignum arithmetics)
When I put all together I got this C++ code:
int generate()
{
const int n1=6; // number of set bits
const int n=32; // number of bits
char x[n+2]; // output number string
int i[n1],j,cnt; // nested for loops iterator variables and found solutions count
for (j=0;j<n;j++) x[j]='0'; x[j]='b'; j++; x[j]=0; // x = 0
for (j=0;j<n1;j++){ i[j]=j; x[i[j]]='1'; } // first solution
for (cnt=0;;)
{
// Form1->mm_log->Lines->Add(x); // here x is the valid answer to print
cnt++;
for (j=n1-1;j>=0;j--) // this emulates n1 nested for loops
{
x[i[j]]='0'; i[j]++;
if (i[j]<n-n1+j+1){ x[i[j]]='1'; break; }
}
if (j<0) break;
for (j++;j<n1;j++){ i[j]=i[j-1]+1; x[i[j]]='1'; }
}
return cnt; // found valid answers
};
When I use this with n1=6,n=32 I got this output (without printing the numbers):
cnt = 906192
and it was finished in 4.246 ms on AMD A8-5500 3.2GHz (win7 x64 32bit app no threads) which is fast enough for me...
Beware once you start outputing the numbers somewhere the speed will drop drastically. Especially if you output to console or what ever ... it might be better to buffer the output somehow like outputting 1024 string numbers at once etc... But as I mentioned before I am no Python coder so it might be already handled by the environment...
On top of all this once you will play with variable n1,n you can do the same for zeros instead of ones and use faster approach (if there is less zeros then ones use nested for loops to mark zeros instead of ones)
If the wanted solution numbers are wanted as a number (not a string) then its possible to rewrite this so the i[] or i0,..i5 holds the bitmask instead of bit positions ... instead of inc/dec you just shift left/right ... and no need for x array anymore as the number would be x = i0|...|i5 ...

You could create a counter array for positions of 1s in the number and assemble it by shifting the bits in their respective positions. I created an example below. It runs pretty fast (less than a second for 32 bits on my laptop):
bitCount = 32
oneCount = 6
maxBit = 1<<(bitCount-1)
ones = [1<<b for b in reversed(range(oneCount)) ] # start with bits on low end
ones[0] >>= 1 # shift back 1st one because it will be incremented at start of loop
index = 0
result = []
while index < len(ones):
ones[index] <<= 1 # shift one at current position
if index == 0:
number = sum(ones) # build output number
result.append(number)
if ones[index] == maxBit:
index += 1 # go to next position when bit reaches max
elif index > 0:
index -= 1 # return to previous position
ones[index] = ones[index+1] # and prepare it to move up (relative to next)
64 bits takes about a minute, roughly proportional to the number of values that are output. O(n)
The same approach can be expressed more concisely in a recursive generator function which will allow more efficient use of the bit patterns:
def genOneBits(bitcount=32,onecount=6):
for bitPos in range(onecount-1,bitcount):
value = 1<<bitPos
if onecount == 1: yield value; continue
for otherBits in genOneBits(bitPos,onecount-1):
yield value + otherBits
result = [ n for n in genOneBits(32,6) ]
This is not faster when you get all the numbers but it allows partial access to the list without going through all values.
If you need direct access to the Nth bit pattern (e.g. to get a random one-bits pattern), you can use the following function. It works like indexing a list but without having to generate the list of patterns.
def numOneBits(bitcount=32,onecount=6):
def factorial(X): return 1 if X < 2 else X * factorial(X-1)
return factorial(bitcount)//factorial(onecount)//factorial(bitcount-onecount)
def nthOneBits(N,bitcount=32,onecount=6):
if onecount == 1: return 1<<N
bitPos = 0
while bitPos<=bitcount-onecount:
group = numOneBits(bitcount-bitPos-1,onecount-1)
if N < group: break
N -= group
bitPos += 1
if bitPos>bitcount-onecount: return None
result = 1<<bitPos
result |= nthOneBits(N,bitcount-bitPos-1,onecount-1)<<(bitPos+1)
return result
# bit pattern at position 1000:
nthOneBit(1000) # --> 10485799 (00000000101000000000000000100111)
This allows you to get the bit patterns on very large integers that would be impossible to generate completely:
nthOneBits(10000, bitcount=256, onecount=9)
# 77371252457588066994880639
# 100000000000000000000000000000000001000000000000000000000000000000000000000000001111111
It is worth noting that the pattern order does not follow the numerical order of the corresponding numbers
Although nthOneBits() can produce any pattern instantly, it is much slower than the other functions when mass producing patterns. If you need to manipulate them sequentially, you should go for the generator function instead of looping on nthOneBits().
Also, it should be fairly easy to tweak the generator to have it start at a specific pattern so you could get the best of both approaches.
Finally, it may be useful to obtain then next bit pattern given a known pattern. This is what the following function does:
def nextOneBits(N=0,bitcount=32,onecount=6):
if N == 0: return (1<<onecount)-1
bitPositions = []
for pos in range(bitcount):
bit = N%2
N //= 2
if bit==1: bitPositions.insert(0,pos)
index = 0
result = None
while index < onecount:
bitPositions[index] += 1
if bitPositions[index] == bitcount:
index += 1
continue
if index == 0:
result = sum( 1<<bp for bp in bitPositions )
break
if index > 0:
index -= 1
bitPositions[index] = bitPositions[index+1]
return result
nthOneBits(12) #--> 131103 00000000000000100000000000011111
nextOneBits(131103) #--> 262175 00000000000001000000000000011111 5.7ns
nthOneBits(13) #--> 262175 00000000000001000000000000011111 49.2ns
Like nthOneBits(), this one does not need any setup time. It could be used in combination with nthOneBits() to get subsequent patterns after getting an initial one at a given position. nextOneBits() is much faster than nthOneBits(i+1) but is still slower than the generator function.
For very large integers, using nthOneBits() and nextOneBits() may be the only practical options.

You are dealing with permutations of multisets. There are many ways to achieve this and as #BPL points out, doing this efficiently is non-trivial. There are many great methods mentioned here: permutations with unique values. The cleanest (not sure if it's the most efficient), is to use the multiset_permutations from the sympy module.
import time
from sympy.utilities.iterables import multiset_permutations
t = time.process_time()
## Credit to #BPL for the general setup
multiPerms = ["".join(v) for v in multiset_permutations(["1"]*6+["0"]*26)]
elapsed_time = time.process_time() - t
print(elapsed_time)
On my machine, the above computes in just over 8 seconds. It generates just under a million results as well:
len(multiPerms)
906192

Find maximum sum of sublist in list of positive integers under O(n^2) of specified length Python 3.5

For one of my programming questions, I am required to define a function that accepts two variables, a list of length l and an integer w. I then have to find the maximum sum of a sublist with length w within the list.
Conditions:
1<=w<=l<=100000
Each element in the list ranges from [1, 100]
Currently, my solution works in O(n^2) (correct me if I'm wrong, code attached below), which the autograder does not accept, since we are required to find an even simpler solution.
My code:
def find_best_location(w, lst):
best = 0
n = 0
while n <= len(lst) - w:
lists = lst[n: n + w]
cur = sum(lists)
best = cur if cur>best else best
n+=1
return best
If anyone is able to find a more efficient solution, please do let me know! Also if I computed my big-O notation wrongly do let me know as well!
Thanks in advance!

1) Find sum current of first w elements, assign it to best.
2) Starting from i = w: current = current + lst[i]-lst[i-w], best = max(best, current).
3) Done.

Your solution is indeed O(n^2) (or O(n*W) if you want a tighter bound)
You can do it in O(n) by creating an aux array sums, where:
sums[0] = l[0]
sums[i] = sums[i-1] + l[i]
Then, by iterating it and checking sums[i] - sums[i-W] you can find your solution in linear time
You can even calculate sums array on the fly to reduce space complexity, but if I were you, I'd start with it, and see if I can upgrade my solution next.

Write an algorithm that returns the majority element in a list

(10 points) Write an O(nlogn) algorithm to find the majority element of a list of items. (Assume that the number of elements is a power of 2). Again, the only operation you can use on items of the list is equality comparison. Hint: solve a problem of size n by solving two sub-problems of size n/2
This was a test question on divide and conquer, for my algorithms class.
Here is the code I wrote in python 3.5.
def majElement(L):
tally = 0
if len(L) == 1:
return 1
for i in range(len(L)):
tally = majElement(L[i:len(L)]) + majElement(L[len(L)/2:])
if tally > (len(L)/2):
print(L[i])
This code results in a stack overflow. Some how I'm not reaching my base case.
How can I stop the infinite recursive calls?

Not sure about the divide and conquer approach, but majority element in an array/list can be identified in O(nlogn) time.
This can be done using a binary search tree using structure
struct tree
{
int element;
int count;
}BST;
Algorithm:
Insert elements in BST one by one and if already present then increment the count of the node.
At any stage, if count of a node becomes more than n/2 then return.
Now, the worst case complexity can be O(n^2) in case of skewed BST. So, use a self balancing BST to ensure O(nlogn) time.
If you want to do it using Divide and Conquer approach,
Algorithm :
Divide array into two parts L and R.
int m1 = Majority(L); int m2 = Majority(R);
if m1 is majority return it.
if m2 is majority retun it.
Otherwise return "no majority element".
Code :
i=0;j=arr.length;
int majority(int *A, int i, int j)
{
int m1= majority(A, i, j/2-1);
int m2= majority(A, j/2+1, j);
int count = 0;
for(int i=0; i<j; i++)
if(A[i] == m1)
count++;
if(count > j/2)
return m1;
count = 0;
for(int i=0; i<j; i++)
if(A[i] == m2)
count++;
if(count > j/2)
return m2;
}
return -1;
}

Your first problem is that you are only returning a value in the base case; the rest of the time, you print a value and return None.
Second, the first recursive call is made when i == 0, which means L[i:len(L)] is the same as L, so you aren't acutually reducing the size of the problem. At the very least, you want for i in range(1, len(L)), but I suspect you aren't properly decomposing the problem into subproblems. (You shouldn't have to make as many recursive calls as you are proposing.)

Not sure how your algorithm works,
mine does the implementation of finding majority element using divide and conquer technique. It divides the array into two slices and checks for potential candidate. It is then checked that whether it is majority element or not.
def divideAndConquer(array): #Function to solve by divide and conquer
if(len(array)==1):
return array[0]
middle = len(array)//2
left = array[:middle] #First half
right = array[middle:] #Second half
cLeft = divideAndConquer(left) #candidate for left side
cRight = divideAndConquer(right) #candidate for right side
if cLeft == cRight: #if both have the same candidate, it is the one
return cLeft
if array.count(cLeft)>middle: #if not, check in left
return cLeft
if array.count(cRight)>middle: #else check in right
return cRight
return "No majority element" #no majority element

Runtime of python's if substring in string

What is the big O of the following if statement?
if "pl" in "apple":
...
What is the overall big O of how python determines if the string "pl" is found in the string "apple"
or any other substring in string search.
Is this the most efficient way to test if a substring is in a string? Does it use the same algorithm as .find()?

The time complexity is O(N) on average, O(NM) worst case (N being the length of the longer string, M, the shorter string you search for). As of Python 3.10, heuristics are used to lower the worst-case scenario to O(N + M) by switching algorithms.
The same algorithm is used for str.index(), str.find(), str.__contains__() (the in operator) and str.replace(); it is a simplification of the Boyer-Moore with ideas taken from the Boyer–Moore–Horspool and Sunday algorithms.
See the original stringlib discussion post, as well as the fastsearch.h source code; until Python 3.10, the base algorithm has not changed since introduction in Python 2.5 (apart from some low-level optimisations and corner-case fixes).
The post includes a Python-code outline of the algorithm:
def find(s, p):
# find first occurrence of p in s
n = len(s)
m = len(p)
skip = delta1(p)[p[m-1]]
i = 0
while i <= n-m:
if s[i+m-1] == p[m-1]: # (boyer-moore)
# potential match
if s[i:i+m-1] == p[:m-1]:
return i
if s[i+m] not in p:
i = i + m + 1 # (sunday)
else:
i = i + skip # (horspool)
else:
# skip
if s[i+m] not in p:
i = i + m + 1 # (sunday)
else:
i = i + 1
return -1 # not found
as well as speed comparisons.
In Python 3.10, the algorithm was updated to use an enhanced version of the Crochemore and Perrin's Two-Way string searching algorithm for larger problems (with p and s longer than 100 and 2100 characters, respectively, with s at least 6 times as long as p), in response to a pathological edgecase someone reported. The commit adding this change included a write-up on how the algorithm works.
The Two-way algorithm has a worst-case time complexity of O(N + M), where O(M) is a cost paid up-front to build a shift table from the s search needle. Once you have that table, this algorithm does have a best-case performance of O(N/M).

In Python 3.4.2, it looks like they are resorting to the same function, but there may be a difference in timing nevertheless. For example, s.find first is required to look up the find method of the string and such.
The algorithm used is a mix between Boyer-More and Horspool.

You can use timeit and test it yourself:
maroun#DQHCPY1:~$ python -m timeit 's = "apple";s.find("pl")'
10000000 loops, best of 3: 0.125 usec per loop
maroun#DQHCPY1:~$ python -m timeit 's = "apple";"pl" in s'
10000000 loops, best of 3: 0.0371 usec per loop
Using in is indeed faster (0.0371 usec compared to 0.125 usec).
For actual implementation, you can look at the code itself.

I think the best way to find out is to look at the source. This looks like it would implement __contains__:
static int
bytes_contains(PyObject *self, PyObject *arg)
{
Py_ssize_t ival = PyNumber_AsSsize_t(arg, PyExc_ValueError);
if (ival == -1 && PyErr_Occurred()) {
Py_buffer varg;
Py_ssize_t pos;
PyErr_Clear();
if (PyObject_GetBuffer(arg, &varg, PyBUF_SIMPLE) != 0)
return -1;
pos = stringlib_find(PyBytes_AS_STRING(self), Py_SIZE(self),
varg.buf, varg.len, 0);
PyBuffer_Release(&varg);
return pos >= 0;
}
if (ival < 0 || ival >= 256) {
PyErr_SetString(PyExc_ValueError, "byte must be in range(0, 256)");
return -1;
}
return memchr(PyBytes_AS_STRING(self), (int) ival, Py_SIZE(self)) != NULL;
}
in terms of stringlib_find(), which uses fastsearch().

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find num of overlapping and non-overlapping substrings in a string - python

My solution is combination between Rabin-Karp and Knuth–Morris–Pratt algorithms. http://codility.com/cert/view/certB6J4FV-W89WX4ZABTDRVAG6/details

The Z-Algorithm would be a good solution.

Related

Finding a permutation of one string in another: Xor solution strange behavior

How do I find all 32 bit binary numbers that have exactly six 1 and rest 0

Find maximum sum of sublist in list of positive integers under O(n^2) of specified length Python 3.5

Write an algorithm that returns the majority element in a list

Runtime of python's if substring in string

Categories

Resources