Inbuilt Python Function for String Comparison like N-gram

Inbuilt Python Function for String Comparison like N-gram - python

Is there is any inbuilt function in Python Which performs like Ngram.Compare('text','text2') String Comparison.I don't want to install N-gram module.I tried all the Public and Private Functions which i got by doing dir('text')
I want to get a percentage Match on comparison of two strings.

You want the Levenshtein distance which is implemented through
http://pypi.python.org/pypi/python-Levenshtein/
Not wanting to install something means: you have to write the code yourself.
http://en.wikipedia.org/wiki/Levenshtein_distance

difflib in the standard library.
You can also do a Levenshtein distance:
def lev(seq1, seq2):
oneago = None
thisrow = range(1, len(seq2) + 1) + [0]
for x in xrange(len(seq1)):
twoago, oneago, thisrow = oneago, thisrow, [0] * len(seq2) + [x + 1]
for y in xrange(len(seq2)):
delcost = oneago[y] + 1
addcost = thisrow[y - 1] + 1
subcost = oneago[y - 1] + (seq1[x] != seq2[y])
thisrow[y] = min(delcost, addcost, subcost)
return thisrow[len(seq2) - 1]
def di(seq1,seq2):
return float(lev(seq1,seq2))/min(len(seq1),len(seq2))
print lev('spa','spam')
print di('spa','spam')

Related

Replacing Sympy indexed symbols with numeric values

I have a sympy expression I want to put numerical values in after differentiating it. The variables I want to replace are all the x[i], y[i] and R_abs[i] in the last expression and are numpy arrays a la
rx=np.array([-0.357, -0.742, -1.078, 0.206])
But trying subs or replace either doesn't do anything or raises the error that Symbols dont allow indexation for for example e1.subs(x[1],rx[0]). I pretty much went through every iteration I could think of to no avail.
import sympy as sp
r0,ge_x,ge_y,bx,by = sp.symbols('r0,ge_x,ge_y,bx,by', real=True) #Main symbols
i,x,y,R_abs = sp.symbols('i,x,y,R_abs', real=True) #Helper symbols
n=4
s2=sp.Sum((bx+r0*sp.Indexed('x',i)/sp.Indexed('R_abs',i)+ge_x*sp.Indexed('x',i)+ge_y*sp.Indexed('y',i)-sp.Indexed('x',i))**2+(by+r0*sp.Indexed('y',i)/sp.Indexed('R_abs',i)-ge_x*sp.Indexed('y',i)+ge_y*sp.Indexed('x',i)-sp.Indexed('y',i))**2,(i,1,n))
e1=sp.Eq(sp.diff(s2,bx).doit(),0)
With e1 then being
Eq(8*bx + 2*ge_x*x[1] + 2*ge_x*x[2] + 2*ge_x*x[3] + 2*ge_x*x[4] + 2*ge_y*y[1] + 2*ge_y*y[2] + 2*ge_y*y[3] + 2*ge_y*y[4] + 2*r0*x[4]/R_abs[4] + 2*r0*x[3]/R_abs[3] + 2*r0*x[2]/R_abs[2] + 2*r0*x[1]/R_abs[1] - 2*x[1] - 2*x[2] - 2*x[3] - 2*x[4], 0)
In here I would like to replace all the x, y, and R_abs with their numerical values.

I've always struggled with indexing in SymPy. Turns out, making Function instances are way easier than indexing instances of Symbol. It also makes notation simpler.
Also note that by using strings in your expression, I think SymPy makes its own symbols with those same string names but they can't be accessed with yours since your symbols are different. At least that's what happens sometimes to me.
Here is a working sample:
import sympy as sp
r0, ge_x, ge_y, bx, by = sp.symbols("r0 ge_x ge_y bx by", real=True) # main symbols
# define functions that will take the role of indexed symbols
x = sp.Function("x")
y = sp.Function("y")
R_abs = sp.Function("R_abs")
i = sp.Symbol("i", positive=True, integer=True)
n = 4
s2 = sp.Sum((bx + r0 * x(i) / R_abs(i) + ge_x * x(i) + ge_y * y(i) - x(i)) ** 2 +
(by + r0 * y(i) / R_abs(i) - ge_x * y(i) + ge_y * x(i) - y(i)) ** 2, (i, 1, n))
s2_prime = sp.diff(s2, bx).doit().simplify()
print(s2_prime)
# whatever lists you want. Can even be an instance of `np.ndarray`
# note that you summed from 1 to n so the 0th element will not be used
x_array = [0, 1, 2, 3, 4]
y_array = [4, 3, 2, 1, 0]
R_abs_array = [-10, 10, 5, 4, 3]
# define a function to access these array elements
x_function = lambda index: x_array[index]
y_function = lambda index: y_array[index]
R_abs_function = lambda index: R_abs_array[index]
# no idea why subs does not work and you MUST keep the same name for the variable.
# you can't have for example `evaluated_s2_prime = ...`.
# Probably something to do with forcing sp to remove references to `x`?
s2_prime = s2_prime.replace(x, x_function).replace(y, y_function).replace(R_abs, R_abs_function)
print(s2_prime)
Producing:
8*bx + 2*ge_x*x(1) + 2*ge_x*x(2) + 2*ge_x*x(3) + 2*ge_x*x(4) + 2*ge_y*y(1) + 2*ge_y*y(2) + 2*ge_y*y(3) + 2*ge_y*y(4) + 2*r0*x(4)/R_abs(4) + 2*r0*x(3)/R_abs(3) + 2*r0*x(2)/R_abs(2) + 2*r0*x(1)/R_abs(1) - 2*x(1) - 2*x(2) - 2*x(3) - 2*x(4)
8*bx + 20*ge_x + 12*ge_y + 31*r0/6 - 20

python: changing symbol variable and assign numerical value

In order to calculate derivatives and other expressions I used the sympy package and said that T = sy.Symbol('T') now that I have calculated the right expression:
E= -T**2*F_deriv_T(T,rho)
where
def F_deriv_rho(T,rho):
ret = 0
for n in range(5):
for m in range(4):
inner= c[n,m]*g_rho_deriv_rho_np*g_T_np
ret += inner
return ret
that looks like this:
F_deriv_rho: [0.0 7.76971e-5*T 0.0001553942*T**2*rho
T*(-5.14488e-5*log(rho) - 5.14488e-5)*log(T) + T*(1.22574e-5*log(rho)+1.22574e-5)*log(T) + T*(1.89488e-5*log(rho) + 1.89488e-5)*log(T) + T(2.29441e-5*log(rho) + 2.29441e-5)*log(T) + T*(7.49956e-5*log(rho) + 7.49956e-5)*log(T)
T**2*(-0.0001028976*rho*log(rho) - 5.14488e-5*rho)*log(T) + T**2*(2.45148e-5*rho*log(rho) + 1.22574e-5*rho)*log(T) + T**2*(3.78976e-5*rho*log(rho) + 1.89488e-5*rho)*log(T) + T**2*(4.58882e-5*rho*log(rho) + 2.29441e-5*rho)*log(T) + T**2*(0.0001499912*rho*log(rho) + 7.49956e 5*rho)*log(T)]
with python I would like to change T (and rho) as a symbol to a value. How could I do that?
So, I would like to create 10 numbers like T_def = np.arange(2000, 10000, 800)and exchange all my sy.symbol(T) by iterating through the 10 values I created in the array.
Thanks for your help

I have found the solution according to this post:
How to substitute multiple symbols in an expression in sympy?
by usings "subs":
>>> from sympy import Symbol
>>> x, y = Symbol('x y')
>>> f = x + y
>>> f.subs({x:10, y: 20})
>>> f
30
There's more for this kinda thing here: http://docs.sympy.org/latest/tutorial/basic_operations.html
EDIT: A faster way would be by using "lamdify" as suggested by #Bjoern Dahlgren

How to search for discontinuous characters in Python list?

I am trying to search a list (DB) for possible matches of fragments of text. For instance, I have a DB with text "evilman". I want to use user inputs to search for any possible matches in the DB and give the answer with a confidence. If the user inputs "hello", then there are no possible matches. If the user inputs "evil", then the possible match is evilman with a confidence of 57% (4 out of 7 alphabets match) and so on.
However, I also want a way to match input text such as "evxxman". 5 out of 7 characters of evxxman match the text "evilman" in the DB. But a simple check in python will say no match since it only outputs text that matches consecutively. I hope it makes sense. Thanks
Following is my code:
db = []
possible_signs = []
db.append("evilman")
text = raw_input()
for s in db:
if text in s:
if len(text) >= len(s)/2:
possible_signs.append(s)
count += 1
confidence = (float(len(text)) / float(len(s))) * 100
print "Confidence:", '%.2f' %(confidence), "<possible match:>", possible_signs[0]

This first version seems to comply with your exemples. It make the strings "slide" against each other, and count the number of identical characters.
The ratio is made by dividing the character count by the reference string length. Add a max and voila.
Call it for each string in your DB.
def commonChars(txt, ref):
txtLen = len(txt)
refLen = len(ref)
r = 0
for i in range(refLen + (txtLen - 1)):
rStart = abs(min(0, txtLen - i - 1))
tStart = txtLen -i - 1 if i < txtLen else 0
l = min(txtLen - tStart, refLen - rStart)
c = 0
for j in range(l):
if txt[tStart + j] == ref[rStart + j]:
c += 1
r = max(r, c / refLen)
return r
print(commonChars('evxxman', 'evilman')) # 0.7142857142857143
print(commonChars('evil', 'evilman')) # 0.5714285714285714
print(commonChars('man', 'evilman')) # 0.42857142857142855
print(commonChars('batman', 'evilman')) # 0.42857142857142855
print(commonChars('batman', 'man')) # 1.0
This second version produces the same results, but using the difflib mentioned in other answers.
It computes matching blocks, sum their lengths, and computes the ratio against the reference length.
import difflib
def commonBlocks(txt, ref):
matcher = difflib.SequenceMatcher(a=txt, b=ref)
matchingBlocks = matcher.get_matching_blocks()
matchingCount = sum([b.size for b in matchingBlocks])
return matchingCount / len(ref)
print(commonBlocks('evxxman', 'evilman')) # 0.7142857142857143
print(commonBlocks('evxxxxman', 'evilman')) # 0.7142857142857143
As shown by the calls above, the behavior is slightly different. "holes" between matching blocks are ignored, and do not change the final ratio.

For finding matches with a quality-estimation, have a look at difflib.SequenceMatcher.ratio and friends - these functions might not be the fastest match-checkers but they are easy to use.
Example copied from difflib docs
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.quick_ratio()
0.75
>>> s.real_quick_ratio()
1.0

Based on your description and examples, it seems to me that you're actually looking for something like the Levenshtein (or edit) distance. Note that it does not quite give the scores you specify, but I think it gives the scores you actually want.
There are several packages implementing this efficiently, e.g., distance:
In [1]: import distance
In [2]: distance.levenshtein('evilman', 'hello')
Out[2]: 6L
In [3]: distance.levenshtein('evilman', 'evil')
Out[3]: 3L
In [4]: distance.levenshtein('evilman', 'evxxman')
Out[4]: 2L
Note that the library contains several measures of similarity, e.g., jaccard and sorensen return a normalized value per default:
>>> distance.sorensen("decide", "resize")
0.5555555555555556
>>> distance.jaccard("decide", "resize")
0.7142857142857143

Create a while loop and track two iterators, one for your key word ("evil") and one for your query word ("evilman"). Here is some pseudocode:
key = "evil"
query = "evilman"
key_iterator = 0
query_iterator = 0
confidence_score = 0
while( key_iterator < key.length && query_iterator < query.length ) {
if (key[key_iterator] == query[query_iterator]) {
confidence_score++
key_iterator++
}
query_iterator++
}
// If we didnt reach the end of the key
if (key_iterator != key.length) {
confidence_score = 0
}
print ("Confidence: " + confidence_score + " out of " + query.length)

Find length of a string that includes its own length?

I want to get the length of a string including a part of the string that represents its own length without padding or using structs or anything like that that forces fixed lengths.
So for example I want to be able to take this string as input:
"A string|"
And return this:
"A string|11"

On the basis of the OP tolerating such an approach (and to provide an implementation technique for the eventual python answer), here's a solution in Java.
final String s = "A String|";
int n = s.length(); // `length()` returns the length of the string.
String t; // the result
do {
t = s + n; // append the stringified n to the original string
if (n == t.length()){
return t; // string length no longer changing; we're good.
}
n = t.length(); // n must hold the total length
} while (true); // round again
The problem of, course, is that in appending n, the string length changes. But luckily, the length only ever increases or stays the same. So it will converge very quickly: due to the logarithmic nature of the length of n. In this particular case, the attempted values of n are 9, 10, and 11. And that's a pernicious case.

A simple solution is :
def addlength(string):
n1=len(string)
n2=len(str(n1))+n1
n2 += len(str(n2))-len(str(n1)) # a carry can arise
return string+str(n2)
Since a possible carry will increase the length by at most one unit.
Examples :
In [2]: addlength('a'*8)
Out[2]: 'aaaaaaaa9'
In [3]: addlength('a'*9)
Out[3]: 'aaaaaaaaa11'
In [4]: addlength('a'*99)
Out[4]: 'aaaaa...aaa102'
In [5]: addlength('a'*999)
Out[5]: 'aaaa...aaa1003'

Here is a simple python port of Bathsheba's answer :
def str_len(s):
n = len(s)
t = ''
while True:
t = s + str(n)
if n == len(t):
return t
n = len(t)
This is a much more clever and simple way than anything I was thinking of trying!
Suppose you had s = 'abcdefgh|, On the first pass through, t = 'abcdefgh|9
Since n != len(t) ( which is now 10 ) it goes through again : t = 'abcdefgh|' + str(n) and str(n)='10' so you have abcdefgh|10 which is still not quite right! Now n=len(t) which is finally n=11 you get it right then. Pretty clever solution!

It is a tricky one, but I think I've figured it out.
Done in a hurry in Python 2.7, please fully test - this should handle strings up to 998 characters:
import sys
orig = sys.argv[1]
origLen = len(orig)
if (origLen >= 98):
extra = str(origLen + 3)
elif (origLen >= 8):
extra = str(origLen + 2)
else:
extra = str(origLen + 1)
final = orig + extra
print final
Results of very brief testing
C:\Users\PH\Desktop>python test.py "tiny|"
tiny|6
C:\Users\PH\Desktop>python test.py "myString|"
myString|11
C:\Users\PH\Desktop>python test.py "myStringWith98Characters.........................................................................|"
myStringWith98Characters.........................................................................|101

Just find the length of the string. Then iterate through each value of the number of digits the length of the resulting string can possibly have. While iterating, check if the sum of the number of digits to be appended and the initial string length is equal to the length of the resulting string.
def get_length(s):
s = s + "|"
result = ""
len_s = len(s)
i = 1
while True:
candidate = len_s + i
if len(str(candidate)) == i:
result = s + str(len_s + i)
break
i += 1

This code gives the result.
I used a few var, but at the end it shows the output you want:
def len_s(s):
s = s + '|'
b = len(s)
z = s + str(b)
length = len(z)
new_s = s + str(length)
new_len = len(new_s)
return s + str(new_len)
s = "A string"
print len_s(s)

Here's a direct equation for this (so it's not necessary to construct the string). If s is the string, then the length of the string including the length of the appended length will be:
L1 = len(s) + 1 + int(log10(len(s) + 1 + int(log10(len(s)))))
The idea here is that a direct calculation is only problematic when the appended length will push the length past a power of ten; that is, at 9, 98, 99, 997, 998, 999, 9996, etc. To work this through, 1 + int(log10(len(s))) is the number of digits in the length of s. If we add that to len(s), then 9->10, 98->100, 99->101, etc, but still 8->9, 97->99, etc, so we can push past the power of ten exactly as needed. That is, adding this produces a number with the correct number of digits after the addition. Then do the log again to find the length of that number and that's the answer.
To test this:
from math import log10
def find_length(s):
L1 = len(s) + 1 + int(log10(len(s) + 1 + int(log10(len(s)))))
return L1
# test, just looking at lengths around 10**n
for i in range(9):
for j in range(30):
L = abs(10**i - j + 10) + 1
s = "a"*L
x0 = find_length(s)
new0 = s+`x0`
if len(new0)!=x0:
print "error", len(s), x0, log10(len(s)), log10(x0)

Damerau-Levenshtein distance code throwing errors?

For some reason, when I try and implement the following code (I'm using Sublime Text 2) it gives me the error "Invalid Syntax" on line 18. I'm not sure why this is, I found the code here and it apparently should work, so I have no idea why it doesn't. Any tips?
Here is the code:
def damerau_levenshtein_distance(word1, word2):
distances = {}
len_word1 = len(word1)
len_word2 = len(word2)
for i in xrange(-1, (len_word1 + 1)):
distances[(i,-1)] = i + 1
for j in xrange(-1, (len_word2 + 1)):
distances[(-1,j)] = j + 1
for i in xrange(len_word1):
if word1[i] == word2[j]:
distance_total = 0
else:
distance_total = 1
distances[(i, j)] = min(
distances[(i-1,j)] + 1, # deletion
distances[(i,j-1)] + 1 # insertion
distances[(i-1,j-1)] + distance_total #substitution
)
if i and j and word1[i] == word2[j-1] and word1[i-1] == word2[j]:
distances[(i,j)] = min(distances[(i,j)], distances[i-2,j-2] + distance_total) # transposition
return distances[len_word1-1,len_word2-1]

there is an error should be:
,#insertion

Looks like you've fixed this issue, but if you don't want to implement all of these yourself, you can use the jellyfish package found in pypi: https://pypi.python.org/pypi/jellyfish. I've used it to great success in the past.
It contains several distance functions, including Damerau-Levenshtein distances.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Inbuilt Python Function for String Comparison like N-gram - python

Is there is any inbuilt function in Python Which performs like Ngram.Compare('text','text2') String Comparison.I don't want to install N-gram module.I tried all the Public and Private Functions which i got by doing dir('text') I want to get a percentage Match on comparison of two strings.

You want the Levenshtein distance which is implemented through http://pypi.python.org/pypi/python-Levenshtein/ Not wanting to install something means: you have to write the code yourself. http://en.wikipedia.org/wiki/Levenshtein_distance

Related

Replacing Sympy indexed symbols with numeric values

python: changing symbol variable and assign numerical value

How to search for discontinuous characters in Python list?

Find length of a string that includes its own length?

Damerau-Levenshtein distance code throwing errors?

Categories

Resources