Find one occurence of substring using suffix array

Find one occurence of substring using suffix array - python

I'm trying to figure out how to binary search in suffix array for one occurence of pattern.
Let's have a text: petertomasjohnerrnoerror.
I try to find er.
SA is a suffix array of this text: 8,14,19,3,1,12,10,7,13,17,18,11,6,22,0,23,16,21,15,20,4,9,2,5
Now, I want to find any index of suffix array, which value pointing at one 'er'. So the output would be index in SA pointing at 3,14 or 19 so it would return 1,2 or 3
I'm trying to use a binary search but I can't figure out how.
def findOneOccurence(text,SA,p):
high = len(text)-1 # The last index
low = 0 # the lowest index
while True:
check = (high-low)/2 # find a middle
if p in text[SA[check]:SA[check]+len(p)]:
return check
else:
if text[SA[check]:SA[check]+len(p)]<p:
low = check
else:
high = check
if high<=low:
return None
This returns 11. But text[SA[11]:SA[11]+2] is 'oh' instad of 'er'.
Where could be the problem?
This function would work on huge texts about millions of chars.
EDIT: I've found a mistake. Instead of if text[SA[check]:SA[check+len(p)]]<p: should be text[SA[check]:SA[check]+len(p)]<p: but it's still wrong. It returns None instead of 'er'
EDIT II: Another mistake: if high>=low changed to high<=low, now, it returns 2 which is good.
EDIT III: Now it works, but on some inputs it gets to the loop and never ends.

Borrowing and editing https://hg.python.org/cpython/file/2.7/Lib/bisect.py
>>> text= 'petertomasjohnerrnoerror'
>>> SA = 8,14,19,3,1,12,10,7,13,17,18,11,6,22,0,23,16,21,15,20,4,9,2,5
>>> def bisect_left(a, x, text, lo=0, hi=None):
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi)//2
if text[a[mid]:] < x: lo = mid+1
else: hi = mid
if not text[a[lo]:].startswith(x):
# i suppose text[a[lo]:a[lo]+len(x)] == x could be a faster check
raise IndexError('not found')
return a[lo]
>>> bisect_left(SA, 'er', text)
14

Related

Maximal set of string-covering substring terms

I want to calculate the largest covering of a string from many sets of substrings.
All strings in this problem are lowercased, and contain no whitespace or unicode strangeness.
So, given a string: abcdef, and two groups of strings: ['abc', 'bc'], ['abc', 'd'], the second group (['abc', 'd']) covers more of the original string. Order matters for exact matches, so the term group ['fe', 'cba'] would not match the original string.
I have a large collection of strings, and a large collection of terms-groups. So I would like a bit faster implementation if possible.
I've tried the following in Python for an example. I've used Pandas and Numpy because I thought it may speed it up a bit. I'm also running into an over-counting problem as you'll see below.
import re
import pandas as pd
import numpy as np
my_strings = pd.Series(['foobar', 'foofoobar0', 'apple'])
term_sets = pd.Series([['foo', 'ba'], ['foo', 'of'], ['app', 'ppl'], ['apple'], ['zzz', 'zzapp']])
# For each string, calculate best proportion of coverage:
# Try 1: Create a function for each string.
def calc_coverage(mystr, term_sets):
# Total length of string
total_chars = len(mystr)
# For each term set, sum up length of any match. Problem: this over counts when matches overlap.
total_coverage = term_sets.apply(lambda x: np.sum([len(term) if re.search(term, mystr) else 0 for term in x]))
# Fraction of String covered. Note the above over-counting can result in fractions > 1.0.
coverage_proportion = total_coverage/total_chars
return coverage_proportion.argmax(), coverage_proportion.max()
my_strings.apply(lambda x: calc_coverage(x, term_sets))
This results in:
0 (0, 0.8333333333333334)
1 (0, 0.5)
2 (2, 1.2)
Which presents some problems. The biggest problem I see is that over-lapping terms are being counted up separately, which results in the 1.2 or 120% coverage.
I think the ideal output would be:
0 (0, 0.8333333333333334)
1 (0, 0.8)
2 (3, 1.0)
I think I can write a double for loop and brute force it. But this problem feels like there's a more optimal solution. Or a small change on what I've done so far to get it to work.
Note: If there is a tie- returning the first is fine. I'm not too interested in returning all best matches.

Ok, this is not optimized but let's start fixing the results. I believe you have two issues: one is the over-counting in apple; the other is the under-counting in foofoobar0.
Solving the second issue when the term set is composed of two non-overlapping terms (or just one term), is easy:
sum([s.count(t)*len(t) for t in ts])
will do the job.
Similarly, when we have two overlapping terms, we will just take the "best" one:
max([s.count(t)*len(t) for t in ts])
So we are left with the problem of recognizing when the two terms overlap. I don't even consider term sets with more than two terms, because the solution will already be painfully slow with two :(
Let's define a function to test for overlapping:
def terms_overlap(s, ts):
if ts[0] not in s or ts[1] not in s:
return False
start = 0
while (pos_0 := s.find(ts[0], start)) > -1:
if (pos_1 := s.find(ts[1], pos_0)) > -1:
if pos_0 <= pos_1 < (pos_0 + len(ts[0]) - 1):
return True
start += pos_0 + len(ts[0])
start = 0
while (pos_1 := s.find(ts[1], start)) > -1:
if (pos_0 := s.find(ts[0], pos_1)) > -1:
if pos_1 <= pos_0 < (pos_1 + len(ts[1]) - 1):
return True
start += pos_1 + len(ts[1])
return False
With that function ready we can finally do:
def calc_coverage(strings, tsets):
for xs, s in enumerate(strings):
best_cover = 0
best_ts = 0
for xts, ts in enumerate(tsets):
if len(ts) == 1:
cover = s.count(ts[0])*len(ts[0])
elif len(ts) == 2:
if terms_overlap(s, ts):
cover = max([s.count(t)*len(t) for t in ts])
else:
cover = sum([s.count(t)*len(t) for t in ts])
else:
raise ValueError('Cannot handle term sets of more than two terms')
if cover > best_cover:
best_cover = cover
best_ts = xts
print(f'{xs}: {s:15} {best_cover:2d} / {len(s):2d} = {best_cover/len(s):8.3f} ({best_ts}: {tsets[best_ts]})')
>>> calc_coverage(my_strings, term_sets)
0: foobar 5 / 6 = 0.833 (0: ['foo', 'ba'])
1: foofoobar0 8 / 10 = 0.800 (0: ['foo', 'ba'])
2: apple 5 / 5 = 1.000 (3: ['apple'])

My function isn't printing anything other than "None"

I'm doing an assignment for school that's due tonight and I can't figure out why it isn't working. I'm really new to programming in general but I just don't understand why it doesn't work.
The program is supposed to convert from decimal to binary and, depending on how big the number is, print it in either 8 bits or 16 bits.
def dec2bin(värde, antal_bitar):
while bitvärde == (2 ** (antal_bitar - 1)):
if värde >= bitvärde:
return str("1")
värde= värde - bitvärde
else:
return str("0")
antal_bitar = antal_bitar - 1
invärde_ok = False
invärde = 0
while invärde_ok == False:
invärde=(int(input("Ange ett decimalt värde: ")))
if (invärde > 65536):
print("Fel. Kan inte hantera stora tal. Försök igen.")
else:
if invärde < 0:
print("Fel. Kan bara hantera positiva tal. Försök igen.")
else:
invärde_ok = True
if invärde < 256:
bitvärde=8
print("Talet", invärde , "ryms i en byte och blir binärt:")
print(dec2bin(invärde,bitvärde))
else:
bitvärde=16
print("Talet", invärde , "ryms i 16 bitar och blir binärt:")
print(dec2bin(invärde,bitvärde))
Sorry for the Swedish parts.

The problem is, instead of giving bitvarde a new value in each iteration in your dec2bin function, you're checking if it equals a certain value - which it does not. Instead, you should use a For loop,
for i in range(y-1,-1,-1):
which will give i a different value each iteration.
range(y-1,-1,-1) simply means that i will get values starting from y-1, changing by -1 every turn, and ending before -1, ie at 0.
In the loop, just add the following:
bitvarde = 2**i
Remove the y=y-1 from the end.
Also, when you use return in the function, that ends the function's execution. You want it to add a 1 or a 0 to the end of the final string.
For that, define an empty string, result = "", in the beginning (before the for loop).
instead of
return str("1"), use result += "1", which simply means result = result + "1".
at the end of the function, after the loop, put:
return result
That should do it! Of course, you can rename result as something else in Swedish.
Here's what the final code should look like:
def dec2bin(värde, antal_bitar):
result = ""
for i in range(antal_bitar-1,-1,-1):
bitvärde = 2**(i)
if värde>=bitvärde:
result += "1"
värde=värde-bitvärde
else:
result += "0"
return result
Hopefully this matches the pseudocode you were given.

Create "synthetic points"

I need to create inside a python routine, something that I am calling "synthetic points".
I have a series of data which vary between -1 and 1, however, when I put this data on a chart, they form a trapezoidal chart.
What I would like to do is create points where the same x-axis value, could take two y-axis values, and then, this will create a chart with
straight lines making a "rectangular chart"
An example the format data that I have:
0;-1
1;-1
2;-1
3;-1
4;-1
5;-1
6;-1
7;1
8;1
9;1
10;1
11;1
12;1
13;1
14;1
15;1
16;-1
17;-1
18;-1
19;-1
20;-1
For example, in this case, I would need the data assume the following format:
0;-1
1;-1
2;-1
3;-1
4;-1
5;-1
6;-1
6;1 (point 6 with two values)
7;1
8;1
9;1
10;1
11;1
12;1
13;1
14;1
15;1
15;-1 (point 15 with two values)
16;-1
17;-1
18;-1
19;-1
20;-1
So what you need to do is, always when I had a value change, this will create a new point. This makes the graph, rectangular, as the only possible values for the y variable are -1 and 1.
The code I need to enter is below. What was done next was just to put the input data in this format of -1 and 1.
arq = open('vazdif.out', 'rt')
list = []
i = 0
for row in arq:
field = row.split(';')
vaz = float(field[2])
if vaz < 0:
list.append("-1")
elif vaz > 0:
list.append("1")
n = len(list)
fou = open('res_id.out','wt')
for i in range(n):
fou.write('{};{}\n'.format(i,list[i]))
fou.close
Thank you for your help
P.s. English is not my first language, forgive my mistakes on write or on the code.

I added a new value prev_value, if the previous value is of the opposite sign (multiply with the current value < 0), it adds an extra index to the list.
I think the field[1] and field[2] are probably wrong, but I'll trust your code works so far. Similar with fou, I would replace with with open ...
arq = open('vazdif.out', 'rt')
list = []
i = 0
prev_value = 0
for row in arq:
field = row.split(';')
xxx = int(field[1])
vaz = float(field[2])
if vaz * prev_value < 0:
list.append([list[-1][0], - list[-1][1]])
if vaz < 0:
list.append([xxx, -1])
else:
list.append([xxx, 1])
prev_val = vaz
fou = open('res_id.out','wt')
for i in list:
fou.write(f'{i[0]};{i[1]}\n')
fou.close

python3 how to find the largest prefix of a string of bytes that is a substring of another

I need to find largest prefix (string of bytes starting from the beginning) of a bytes object s1 that is a substring of another bytes object s2 and return the start location in s2 and length. In this case s2 happens to be overlapping s1 as well.
The optimal result is the longest prefix that starts closest to the end of s2.
I have tried to implement this using bytes.rfind method as below.
Note: This is trying to find the largest prefix starting at index index in the bytes object src that is present earlier in src within a maximum of maxOffset bytes before index. Therefore, s1 is src[index:] and s2 is src[index-maxOffset:index+maxLength-1]. maxLength is the maximum length of prefix that I am interested in searching for.
def crl(index, src, maxOffset, maxLength):
"""
Returns starting position in source before index from where the max runlength is detected.
"""
src_size = len(src)
if index > src_size:
return (-1, 0)
if (index+maxLength) > src_size:
maxLength = src_size - index
startPos = max(0, index-maxOffset)
endPos = index+maxLength-1
l = maxLength
while l>1:
if src[index:index+l] in src[startPos:index+l-1]:
p = src.rfind(src[index:index+l], startPos, index+l-1)
return (p,l)
l -= 1
return (-1, 0)
I have also tried to hand-code this as below since the previous implementation was very slow
def ocrl(index, src, maxOffset, maxLength):
"""
Returns starting position in source before index from where the max runlength is detected.
"""
size = len(src)
if index>=size:
return (-1, 0)
startPos = index - 1 # max(0, index-maxOffset)
stopPos = max(0, index-maxOffset)
runs = {}
while(startPos >= stopPos):
currRun = 0
pos = startPos
while src[pos] == src[index+currRun]:
currRun += 1
pos += 1
if currRun == maxLength:
return (startPos, maxLength) #found best possible run
if (pos >= size) or ((index+currRun) >= size):
break
if (currRun > 0) and (currRun not in runs.keys()):
runs[currRun] = startPos
startPos -= 1
if not runs:
return (-1, 0)
else:
# Return the index from where the longest run was found
return (runs[max(runs.keys())], max(runs.keys()))
While 2nd implementation is faster, it is still very slow and I believe inefficient. How can I make this more efficient and run faster ?

In my opinion, you can use a modified Knuth-Morris-Pratt string searching agorithm that matches substrings as long as it can and reminds the longest match found.
I am not sure that there is a benefit work backwards over forward, as when you have found a match you need to continue the search for a longer one (except when you have matched the whole string).

Build suffix array for the second string and search for the first string in that array, choosing the last index of the longest common prefix

Binary Search counter

I create a list of len 100
li2 = list(range(100))
I use the below binary search function, with a counter, however it takes 5 searches to find 50. Should find it on the first try. (100/2) = 50 li2[50] == 50
def binary_search(li,item):
low = 0
high = len(li)-1
trys = 0
while low<=high:
mid = int((low + high)/2)
guess = li[mid]
if guess == item:
return'Found',item, 'in', trys,'searches'
elif guess > item:
trys+=1
high = mid - 1
else:
trys+=1
low = mid + 1
return item,' not found', trys, ' searches attempted'
I run binary_search(li2,50)
and returns below
('Found', 50, 'in', 5, 'searches')

range(100) will return a list with all elements from 0 up until 99. Your binary search algorithm will start searching on 49 and not on 50.

Why reinventing the wheel when you can use bisect, which often have a native implementation, so it's very fast.
bisect_left returns the insertion position of the current item in the list (which must be sorted of course). If the index is outside list range, then item is not in list. If index is within list range and item is located at this index, then you've found it:
import bisect
def binary_search(li,item):
insertion_index = bisect.bisect_left(li,item)
return insertion_index<len(li) and li[insertion_index]==item
testing:
li2 = list(range(100))
print(binary_search(li2,50))
print(binary_search(li2,-2))
print(binary_search(li2,104))
results:
True
False
False

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find one occurence of substring using suffix array - python

Related

Maximal set of string-covering substring terms

My function isn't printing anything other than "None"

Create "synthetic points"

python3 how to find the largest prefix of a string of bytes that is a substring of another

Binary Search counter

Categories

Resources