python notation form textbook - python

I have a code snipped from my algorithms text book and i'm have a hard time understanding it.
K(0) = 0
for w = 1 to W
K(w) = max{K(w - w_i)+vi:w_i<=w}
return K(W)
i'm confused as to whats happening on line 3 what does the colon mean here? can this be written in a different way?

This doesn't look like python. It seems that K is supposed to be an array, but indices are indicated by square brackets in python, so K[0] = 0
And the for w = 1 to W doesn't work in python at all, it would be more like this: for w in range(1, W+1):
As for what the pseudo code does: It looks like for each element of K, it calculates the maximum of all previous values and adds vi.
for w in range(1, W+1):
K[w] = max(K[w - w_i] + vi for w_i in range(1, w+1))
But vi doesn't seem to change, so for positive vi, this produces simply a linearly ascending array (i.e. [0, 2, 4, 6, ... for vi = 2), and for negative it just repeats vi over and over: [0, -3, -3, -3, ... for vi = -3
Since it returns only the last value of the array, it could be simplified to
return W*vi if vi>0 else vi

Related

How to extract the value of ArithRef from within the solver constraint in Z3py?

I am trying to add a constraint from within the Z3py code that requires extracting of a value for an ArithRef variable but I'm unable to do so. I have attached the Z3 snippet below:
from z3 import *
import random
R = 3
H = 4
Dest = [[2,3,4], [0,3,2,5], [1,4,5], [1,4,2], [3,1], [2,3,1]]
s = Solver()
T =[[ Int('r%d_t%d' % (k,t))
for k in range (R)] for t in range (H)]
for t in range(H):
for r in range(R):
s.add(If(t==0, If(r==0, T[t][r]==0, T[t][r]==-1),
If(T[t-1][r]==0, T[t][r]==random.choice(Dest[0]), T[t][r]==-1)))
Here I get the solution as follows:
[[ 0 2 -1 -1]
[-1 -1 -1 -1]
[-1 -1 -1 -1]]
However, I try to generalize the constraint to the following,
for t in range(H):
for r in range(R):
s.add(If(t==0, If(r==0, T[t][r]==0, T[t][r]==-1),
If(T[t-1][r]==0, T[t][r]==random.choice(Dest[T[t-1][r]]), T[t][r]==-1)))
I get the error:
list indices must be integers or slices, not ArithRef
How can this issue be resolved ?
In general, you cannot mix and match indexing into a Python list or calling random on a list when symbolic variables are involved. That is, it is illegal to call A[k] when k is a symbolic integer. That's what the error message is telling you: Until you actually run the solver (via a call to s.check()) and grab the value from a model, you cannot index a list via a symbolic variable. Similarly, calling random.choice isn't going to work either when you need to symbolically pick which list to pick from.
What you need to do, instead, is to create symbolic index values that correspond to your random values, and do a symbolic walk down the list. Something like this:
from z3 import *
R = 3
H = 4
Dest = [[2,3,4], [0,3,2,5], [1,4,5], [1,4,2], [3,1], [2,3,1]]
s = Solver()
T = [[Int('r%d_t%d' % (k,t)) for k in range (R)] for t in range (H)]
# Symbolically walk over a list, grabbing the ith element of it
# The idea here is that i is not just an integer, but can be a symbolic
# integer with other constraitns on it. We also take a continuation, k,
# which we call on the result, for convenience. It's assumed that the
# caller makes sure lst[i] is always within bounds; i.e., i >= 0, and
# i < len(lst).
def SymbolicWalk(i, lst, k):
if len(lst) == 1:
return k(lst[0])
else:
return If(i == 0, k(lst[0]), SymbolicWalk(i-1, lst[1:], k));
# Pick a random element of the list
# This is the symbolic version of random.choice
def SymbolicChoice(lst):
i = FreshInt('internal_choice')
s.add(i >= 0)
s.add(i < len(lst))
return SymbolicWalk(i, lst, lambda x: x)
# Grab the element at given symbolic index, and then call the symbolic choice on it
def index(i, lst):
return SymbolicWalk(i, lst, SymbolicChoice)
for t in range(H):
for r in range(R):
s.add(If(t==0, If(r==0, T[t][r]==0, T[t][r]==-1),
If(T[t-1][r]==0, T[t][r]==index(T[t-1][r], Dest), T[t][r]==-1)))
print(s.check())
print(s.model())
I've added some comments above which I hope should help. But the work-horse here is the function SymbolicWalk which achieves the indexing of a list with a symbolic value. Note the creation of "random"-indexes via the call to FreshInt.
When I run the above it produces:
[ r2_t3 = -1,
r1_t3 = -1,
r0_t3 = -1,
r2_t2 = -1,
r1_t2 = -1,
r0_t2 = -1,
r2_t1 = -1,
r1_t1 = -1,
r0_t1 = 2,
r2_t0 = -1,
r1_t0 = -1,
r0_t0 = 0]
which should be what you are looking for. Note that I elided the outputs to variables internal_choice!k which are internally generated in calls to FreshInt; you shouldn't refer to those values; they are used internally to get the random value from your Dest list.

Python percentage function not working as intended

For a college project I need to output of the number of votes + the percentage of votes out of the total votes for each team that (input)(there are six in total).
I made the program using lists, and got to the part where i made a list with 7 elements: the total number of votes the program registered + the votes consecutively each team got.
I then use this list to run a function that changes the values of the indexes of the list to their percentage, with another function working as a percentage calculator. (Called 'porcentagem' that I tested out and works as intended.)
def porcentagem(p, w):
pc = 100 * float(p)/float(w)
return str(pc) + "%"
def per(list):
listF = [0,0,0,0,0,0,0]
for x in list[1:7]:
if x != 0:
listF[x] = porcentagem(x, list[0])
else:
listF[x] = 0
return listF
For some reason when I input the votes, the results come all out of order. For example:
The list input is List = [6, 3, 2, 1, 0, 0, 0,] but the output is [0, '16.666666666666668%', '33.333333333333336%', '50.0%', 0, 0, 0] (Index 0 is the total if it wasn't clear, and
I have no idea what could be causing this, its changing the orders of the elements apparently (its supposed to come out as 50%, then 33,3...% etc..)
I'm 'new' at programming + spent two months not coding anything + english is not my first language and I'm learning python in portuguese, sorry if it looks obvious lol
The x in for x in list[1:7]: returns the actual value, not the index. So x will be: 3, 2, 1, 0, 0, 0. That means the first listF[x] is listF[3] which is assigning to the 4th element.
A word of caution: list is a constructor for a built-in function, so if you use list as a variable, it might have unintended consequences. Change it to something like percentage_list.
Do something like the following:
def per(per_list):
listF = [0,0,0,0,0,0,0]
for i in range(1, len(per_list)):
x = per_list[i]
if x != 0:
listF[i] = porcentagem(x, per_list[0])
else:
listF[i] = 0
return listF
Output: [0, '50.0%', '33.333333333333336%', '16.666666666666668%', 0, 0, 0]

Finding singulars/sets of local maxima/minima in a 1D-NumPy array (once again)

I would like to have a function that can detect where the local maxima/minima are in an array (even if there is a set of local maxima/minima). Example:
Given the array
test03 = np.array([2,2,10,4,4,4,5,6,7,2,6,5,5,7,7,1,1])
I would like to have an output like:
set of 2 local minima => array[0]:array[1]
set of 3 local minima => array[3]:array[5]
local minima, i = 9
set of 2 local minima => array[11]:array[12]
set of 2 local minima => array[15]:array[16]
As you can see from the example, not only are the singular values detected but, also, sets of local maxima/minima.
I know in this question there are a lot of good answers and ideas, but none of them do the job described: some of them simply ignore the extreme points of the array and all ignore the sets of local minima/maxima.
Before asking the question, I wrote a function by myself that does exactly what I described above (the function is at the end of this question: local_min(a). With the test I did, it works properly).
Question: However, I am also sure that is NOT the best way to work with Python. Are there builtin functions, APIs, libraries, etc. that I can use? Any other function suggestion? A one-line instruction? A full vectored solution?
def local_min(a):
candidate_min=0
for i in range(len(a)):
# Controlling the first left element
if i==0 and len(a)>=1:
# If the first element is a singular local minima
if a[0]<a[1]:
print("local minima, i = 0")
# If the element is a candidate to be part of a set of local minima
elif a[0]==a[1]:
candidate_min=1
# Controlling the last right element
if i == (len(a)-1) and len(a)>=1:
if candidate_min > 0:
if a[len(a)-1]==a[len(a)-2]:
print("set of " + str(candidate_min+1)+ " local minima => array["+str(i-candidate_min)+"]:array["+str(i)+"]")
if a[len(a)-1]<a[len(a)-2]:
print("local minima, i = " + str(len(a)-1))
# Controlling the other values in the middle of the array
if i>0 and i<len(a)-1 and len(a)>2:
# If a singular local minima
if (a[i]<a[i-1] and a[i]<a[i+1]):
print("local minima, i = " + str(i))
# print(str(a[i-1])+" > " + str(a[i]) + " < "+str(a[i+1])) #debug
# If it was found a set of candidate local minima
if candidate_min >0:
# The candidate set IS a set of local minima
if a[i] < a[i+1]:
print("set of " + str(candidate_min+1)+ " local minima => array["+str(i-candidate_min)+"]:array["+str(i)+"]")
candidate_min = 0
# The candidate set IS NOT a set of local minima
elif a[i] > a[i+1]:
candidate_min = 0
# The set of local minima is growing
elif a[i] == a[i+1]:
candidate_min = candidate_min + 1
# It never should arrive in the last else
else:
print("Something strange happen")
return -1
# If there is a set of candidate local minima (first value found)
if (a[i]<a[i-1] and a[i]==a[i+1]):
candidate_min = candidate_min + 1
Note: I tried to enrich the code with some comments to let understand what I do. I know that the function that I propose is
not clean and just prints the results that can be stored and returned
at the end. It was written to give an example. The algorithm I propose should be O(n).
UPDATE:
Somebody was suggesting to import from scipy.signal import argrelextrema and use the function like:
def local_min_scipy(a):
minima = argrelextrema(a, np.less_equal)[0]
return minima
def local_max_scipy(a):
minima = argrelextrema(a, np.greater_equal)[0]
return minima
To have something like that is what I am really looking for. However, it doesn't work properly when the sets of local minima/maxima have more than two values. For example:
test03 = np.array([2,2,10,4,4,4,5,6,7,2,6,5,5,7,7,1,1])
print(local_max_scipy(test03))
The output is:
[ 0 2 4 8 10 13 14 16]
Of course in test03[4] I have a minimum and not a maximum. How do I fix this behavior? (I don't know if this is another question or if this is the right place where to ask it.)
A full vectored solution:
test03 = np.array([2,2,10,4,4,4,5,6,7,2,6,5,5,7,7,1,1]) # Size 17
extended = np.empty(len(test03)+2) # Rooms to manage edges, size 19
extended[1:-1] = test03
extended[0] = extended[-1] = np.inf
flag_left = extended[:-1] <= extended[1:] # Less than successor, size 18
flag_right = extended[1:] <= extended[:-1] # Less than predecessor, size 18
flagmini = flag_left[1:] & flag_right[:-1] # Local minimum, size 17
mini = np.where(flagmini)[0] # Indices of minimums
spl = np.where(np.diff(mini)>1)[0]+1 # Places to split
result = np.split(mini, spl)
result:
[0, 1] [3, 4, 5] [9] [11, 12] [15, 16]
EDIT
Unfortunately, This detects also maxima as soon as they are at least 3 items large, since they are seen as flat local minima. A numpy patch will be ugly this way.
To solve this problem I propose 2 other solutions, with numpy, then with numba.
Whith numpy using np.diff :
import numpy as np
test03=np.array([12,13,12,4,4,4,5,6,7,2,6,5,5,7,7,17,17])
extended=np.full(len(test03)+2,np.inf)
extended[1:-1]=test03
slope = np.sign(np.diff(extended)) # 1 if ascending,0 if flat, -1 if descending
not_flat,= slope.nonzero() # Indices where data is not flat.
local_min_inds, = np.where(np.diff(slope[not_flat])==2)
#local_min_inds contains indices in not_flat of beginning of local mins.
#Indices of End of local mins are shift by +1:
start = not_flat[local_min_inds]
stop = not_flat[local_min_inds+1]-1
print(*zip(start,stop))
#(0, 1) (3, 5) (9, 9) (11, 12) (15, 16)
A direct solution compatible with numba acceleration :
##numba.njit
def localmins(a):
begin= np.empty(a.size//2+1,np.int32)
end = np.empty(a.size//2+1,np.int32)
i=k=0
begin[k]=0
search_end=True
while i<a.size-1:
if a[i]>a[i+1]:
begin[k]=i+1
search_end=True
if search_end and a[i]<a[i+1]:
end[k]=i
k+=1
search_end=False
i+=1
if search_end and i>0 : # Final plate if exists
end[k]=i
k+=1
return begin[:k],end[:k]
print(*zip(*localmins(test03)))
#(0, 1) (3, 5) (9, 9) (11, 12) (15, 16)
I think another function from scipy.signal would be interesting.
from scipy.signal import find_peaks
test03 = np.array([2,2,10,4,4,4,5,6,7,2,6,5,5,7,7,1,1])
find_peaks(test03)
Out[]: (array([ 2, 8, 10, 13], dtype=int64), {})
find_peaks has lots of options and might be quite useful, especially for noisy signals.
Update
The function is really powerful and versatile. You can set several parameters for peak minimal width, height, distance from each other and so on. As example:
test04 = np.array([1,1,5,5,5,5,5,5,5,5,1,1,1,1,1,5,5,5,1,5,1,5,1])
find_peaks(test04, width=1)
Out[]:
(array([ 5, 16, 19, 21], dtype=int64),
{'prominences': array([4., 4., 4., 4.]),
'left_bases': array([ 1, 14, 18, 20], dtype=int64),
'right_bases': array([10, 18, 20, 22], dtype=int64),
'widths': array([8., 3., 1., 1.]),
'width_heights': array([3., 3., 3., 3.]),
'left_ips': array([ 1.5, 14.5, 18.5, 20.5]),
'right_ips': array([ 9.5, 17.5, 19.5, 21.5])})
See documentation for more examples.
There can be multiple ways to solve this. One approach listed here.
You can create a custom function, and use the maximums to handle edge cases while finding mimima.
import numpy as np
a = np.array([2,2,10,4,4,4,5,6,7,2,6,5,5,7,7,1,1])
def local_min(a):
temp_list = list(a)
maxval = max(a) #use max while finding minima
temp_list = temp_list + [maxval] #handles last value edge case.
prev = maxval #prev stores last value seen
loc = 0 #used to store starting index of minima
count = 0 #use to count repeated values
#match_start = False
matches = []
for i in range(0, len(temp_list)): #need to check all values including the padded value
if prev == temp_list[i]:
if count > 0: #only increment for minima candidates
count += 1
elif prev > temp_list[i]:
count = 1
loc = i
# match_start = True
else: #prev < temp_list[i]
if count > 0:
matches.append((loc, count))
count = 0
loc = i
prev = temp_list[i]
return matches
result = local_min(a)
for match in result:
print ("{} minima found starting at location {} and ending at location {}".format(
match[1],
match[0],
match[0] + match[1] -1))
Let me know if this does the trick for you. The idea is simple, you want to iterate through the list once and keep storing minima as you see them. Handle the edges by padding with maximum values on either end. (or by padding the last end, and using the max value for initial comparison)
Here's an answer based on restriding the array into an iterable of windows:
import numpy as np
from numpy.lib.stride_tricks import as_strided
def windowstride(a, window):
return as_strided(a, shape=(a.size - window + 1, window), strides=2*a.strides)
def local_min(a, maxwindow=None, doends=True):
if doends: a = np.pad(a.astype(float), 1, 'constant', constant_values=np.inf)
if maxwindow is None: maxwindow = a.size - 1
mins = []
for i in range(3, maxwindow + 1):
for j,w in enumerate(windowstride(a, i)):
if (w[0] > w[1]) and (w[-2] < w[-1]):
if (w[1:-1]==w[1]).all():
mins.append((j, j + i - 2))
mins.sort()
return mins
Testing it out:
test03=np.array([2,2,10,4,4,4,5,6,7,2,6,5,5,7,7,1,1])
local_min(test03)
Output:
[(0, 2), (3, 6), (9, 10), (11, 13), (15, 17)]
Not the most efficient algorithm, but at least it's short. I'm pretty sure it's O(n^2), since there's roughly 1/2*(n^2 + n) windows to iterate over. This is only partially vectorized, so there may be a way to improve it.
Edit
To clarify, the output is the indices of the slices that contain the runs of local minimum values. The fact that they go one past the end of the run is intentional (someone just tried to "fix" that in an edit). You can use the output to iterate over the slices of minimum values in your input array like this:
for s in local_mins(test03):
print(test03[slice(*s)])
Output:
[2 2]
[4 4 4]
[2]
[5 5]
[1 1]
A pure numpy solution (revised answer):
import numpy as np
y = np.array([2,2,10,4,4,4,5,6,7,2,6,5,5,7,7,1,1])
x = np.r_[y[0]+1, y, y[-1]+1] # pad edges, gives possibility for minima
ups, = np.where(x[:-1] < x[1:])
downs, = np.where(x[:-1] > x[1:])
minend = ups[np.unique(np.searchsorted(ups, downs))]
minbeg = downs[::-1][np.unique(np.searchsorted(-downs[::-1], -ups[::-1]))][::-1]
minlen = minend - minbeg
for line in zip(minlen, minbeg, minend-1): print("set of %d minima %d - %d" % line)
This gives
set of 2 minima 0 - 1
set of 3 minima 3 - 5
set of 1 minima 9 - 9
set of 2 minima 11 - 12
set of 2 minima 15 - 16
np.searchsorted(ups, downs) finds the first ups after every down. This is the "true" end of a minimum.
For the start of the minima, we do it similar, but now in reverse order.
It is working for the example, yet not fully tested. But I would say a good starting point.
You can use argrelmax, as long as there no multiple consecutive equal elements, so first you need to run length encode the array, then use argrelmax (or argrelmin):
import numpy as np
from scipy.signal import argrelmax
from itertools import groupby
def local_max_scipy(a):
start = 0
result = [[a[0] - 1, 0, 0]] # this is to guarantee the left edge is included
for k, g in groupby(a):
length = sum(1 for _ in g)
result.append([k, start, length])
start += length
result.append([a[-1] - 1, 0, 0]) # this is to guarantee the right edge is included
arr = np.array(result)
maxima, = argrelmax(arr[:, 0])
return arr[maxima]
test03 = np.array([2, 2, 10, 4, 4, 4, 5, 6, 7, 2, 6, 5, 5, 7, 7, 1, 1])
output = local_max_scipy(test03)
for val, start, length in output:
print(f'set of {length} maxima start:{start} end:{start + length}')
Output
set of 1 maxima start:2 end:3
set of 1 maxima start:8 end:9
set of 1 maxima start:10 end:11
set of 2 maxima start:13 end:15

How do I determine the high and low values in a series of cyclic data?

I've got some data that represents periodic motion. So, it goes from a high to a low and back again; if you were to plot it, it would like a sine wave. However, the amplitude varies slightly in each cycle. I would like to make a list of each maximum and minimum in the entire sequence. If there were 10 complete cycles, I would end up with 20 numbers, 10 positive (high) and 10 negative (low).
It seems like this is a job for time series analysis, but I'm not familiar with statistics enough to know for sure.
I'm working in python.
Can anybody give me some guidance as far as possible code libraries and terminology?
This isn't an overly complicated problem if you didn't want to use a library, something like this should do what you want. Basically as you iterate through the data if you go from ascending to descending you have a high, and from descending to ascending you have a low.
def get_highs_and_lows(data):
prev = data[0]
high = []
low = []
asc = None
for value in data[1:]:
if not asc and value > prev:
asc = True
low.append(prev)
elif (asc is None or asc) and value < prev:
asc = False
high.append(prev)
prev = value
if asc:
high.append(data[-1])
else:
low.append(data[-1])
return (high, low)
>>> data = [0, 1, 2, 1, 0, -2, 0, 2, 4, 2, 6, 8, 4, 0, 2, 4]
>>> print str(get_highs_and_lows(data))
([2, 4, 8, 4], [0, -2, 2, 0])
You'll probably need to familiarize yourself with some of the popular python science/statistics libraries. numpy comes to mind.
And here's an item from the SciPy mailing list discussing how to do what you want using numpy.
If x is a list of your data, and you happen to know the cycle length, T, try this:
# Create 10 1000-sample cycles of a noisy sine wave.
T = 1000
x = scipy.sin(2*scipy.pi*scipy.arange(10*T)/T) + 0.1*scipy.randn(10*T)
# Find the maximum and minimum of each cycle.
[(min(x[i:i+T]), max(x[i:i+T])) for i in range(0, len(x), T)]
# prints the following:
[(-1.2234858463372265, 1.2508648231644286),
(-1.2272859833650591, 1.2339382830978067),
(-1.2348835727451217, 1.2554960382962332),
(-1.2354184224872098, 1.2305636540601534),
(-1.2367724101594981, 1.2384651681019756),
(-1.2239698560399894, 1.2665865375358363),
(-1.2211500568892304, 1.1687268390393153),
(-1.2471220836642811, 1.296787070454136),
(-1.3047322264307399, 1.1917835644190464),
(-1.3015059337968433, 1.1726658435644288)]
Note that this should work regardless of the phase offset of the sinusoid (with high probability).

Edit Distance in Python

I'm programming a spellcheck program in Python. I have a list of valid words (the dictionary) and I need to output a list of words from this dictionary that have an edit distance of 2 from a given invalid word.
I know I need to start by generating a list with an edit distance of one from the invalid word(and then run that again on all the generated words). I have three methods, inserts(...), deletions(...) and changes(...) that should output a list of words with an edit distance of 1, where inserts outputs all valid words with one more letter than the given word, deletions outputs all valid words with one less letter, and changes outputs all valid words with one different letter.
I've checked a bunch of places but I can't seem to find an algorithm that describes this process. All the ideas I've come up with involve looping through the dictionary list multiple times, which would be extremely time consuming. If anyone could offer some insight, I'd be extremely grateful.
The thing you are looking at is called an edit distance and here is a nice explanation on wiki. There are a lot of ways how to define a distance between the two words and the one that you want is called Levenshtein distance and here is a DP (dynamic programming) implementation in python.
def levenshteinDistance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
And a couple of more implementations are here.
difflib in the standard library has various utilities for sequence matching, including the get_close_matches method that you could use. It uses an algorithm adapted from Ratcliff and Obershelp.
From the docs
>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
Here is my version for Levenshtein distance
def edit_distance(s1, s2):
m=len(s1)+1
n=len(s2)+1
tbl = {}
for i in range(m): tbl[i,0]=i
for j in range(n): tbl[0,j]=j
for i in range(1, m):
for j in range(1, n):
cost = 0 if s1[i-1] == s2[j-1] else 1
tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)
return tbl[i,j]
print(edit_distance("Helloworld", "HalloWorld"))
#this calculates edit distance not levenstein edit distance
word1="rice"
word2="ice"
len_1=len(word1)
len_2=len(word2)
x =[[0]*(len_2+1) for _ in range(len_1+1)]#the matrix whose last element ->edit distance
for i in range(0,len_1+1): #initialization of base case values
x[i][0]=i
for j in range(0,len_2+1):
x[0][j]=j
for i in range (1,len_1+1):
for j in range(1,len_2+1):
if word1[i-1]==word2[j-1]:
x[i][j] = x[i-1][j-1]
else :
x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1
print x[i][j]
Using the SequenceMatcher from Python built-in difflib is another way of doing it, but (as correctly pointed out in the comments), the result does not match the definition of an edit distance exactly. Bonus: it supports ignoring "junk" parts (e.g. spaces or punctuation).
from difflib import SequenceMatcher
a = 'kitten'
b = 'sitting'
required_edits = [
code
for code in (
SequenceMatcher(a=a, b=b, autojunk=False)
.get_opcodes()
)
if code[0] != 'equal'
]
required_edits
# [
# # (tag, i1, i2, j1, j2)
# ('replace', 0, 1, 0, 1), # replace a[0:1]="k" with b[0:1]="s"
# ('replace', 4, 5, 4, 5), # replace a[4:5]="e" with b[4:5]="i"
# ('insert', 6, 6, 6, 7), # insert b[6:7]="g" after a[6:6]="n"
# ]
# the edit distance:
len(required_edits) # == 3
I would recommend not creating this kind of code on your own. There are libraries for that.
For instance the Levenshtein library.
In [2]: Levenshtein.distance("foo", "foobar")
Out[2]: 3
In [3]: Levenshtein.distance("barfoo", "foobar")
Out[3]: 6
In [4]: Levenshtein.distance("Buroucrazy", "Bureaucracy")
Out[4]: 3
In [5]: Levenshtein.distance("Misisipi", "Mississippi")
Out[5]: 3
In [6]: Levenshtein.distance("Misisipi", "Misty Mountains")
Out[6]: 11
In [7]: Levenshtein.distance("Buroucrazy", "Born Crazy")
Out[7]: 4
Similar to Santoshi's solution above but I made three changes:
One line initialization instead of five
No need to define cost alone (just use int(boolean) 0 or 1)
Instead of double for loop use product, (this last one is only cosmetic, double loop seems unavoidable)
from itertools import product
def edit_distance(s1,s2):
d={ **{(i,0):i for i in range(len(s1)+1)},**{(0,j):j for j in range(len(s2)+1)}}
for i, j in product(range(1,len(s1)+1), range(1,len(s2)+1)):
d[i,j]=min((s1[i-1]!=s2[j-1]) + d[i-1,j-1], d[i-1,j]+1, d[i,j-1]+1)
return d[i,j]
Instead of going with Levenshtein distance algo use BK tree or TRIE, as these algorithms have less complexity then edit distance. A good browse over these topic will give a detailed description.
This link will help you more about spell checking.
You need Minimum Edit Distance for this task.
Following is my version of MED a.k.a Levenshtein Distance.
def MED_character(str1,str2):
cost=0
len1=len(str1)
len2=len(str2)
#output the length of other string in case the length of any of the string is zero
if len1==0:
return len2
if len2==0:
return len1
accumulator = [[0 for x in range(len2)] for y in range(len1)] #initializing a zero matrix
# initializing the base cases
for i in range(0,len1):
accumulator[i][0] = i;
for i in range(0,len2):
accumulator[0][i] = i;
# we take the accumulator and iterate through it row by row.
for i in range(1,len1):
char1=str1[i]
for j in range(1,len2):
char2=str2[j]
cost1=0
if char1!=char2:
cost1=2 #cost for substitution
accumulator[i][j]=min(accumulator[i-1][j]+1, accumulator[i][j-1]+1, accumulator[i-1][j-1] + cost1 )
cost=accumulator[len1-1][len2-1]
return cost
Fine tuned codes based on the version from #Santosh and should address the issue brought up by #Artur Krajewski; The biggest difference is replacing an effective 2d matrix
def edit_distance(s1, s2):
# add a blank character for both strings
m=len(s1)+1
n=len(s2)+1
# launch a matrix
tbl = [[0] * n for i in range(m)]
for i in range(m): tbl[i][0]=i
for j in range(n): tbl[0][j]=j
for i in range(1, m):
for j in range(1, n):
#if strings have same letters, set operation cost as 0 otherwise 1
cost = 0 if s1[i-1] == s2[j-1] else 1
#find min practice
tbl[i][j] = min(tbl[i][j-1]+1, tbl[i-1][j]+1, tbl[i-1][j-1]+cost)
return tbl
edit_distance("birthday", "Birthdayyy")
following up on #krassowski's answer
from difflib import SequenceMatcher
def sequence_matcher_edits(word_a, word_b):
required_edits = [code for code in (
SequenceMatcher(a=word_a, b=word_b, autojunk=False).get_opcodes()
)
if code[0] != 'equal'
]
return len(required_edits)
print(f"sequence_matcher_edits {sequence_matcher_edits('kitten', 'sitting')}")
# -> sequence_matcher_edits 3

Categories