Finding if two strings are almost similar

Finding if two strings are almost similar - python

I want to find out if you strings are almost similar. For example, string like 'Mohan Mehta' should match 'Mohan Mehte' and vice versa. Another example, string like 'Umesh Gupta' should match 'Umash Gupte'.
Basically one string is correct and other one is a mis-spelling of it. All my strings are names of people.
Any suggestions on how to achieve this.
Solution does not have to be 100 percent effective.

You can use difflib.sequencematcher if you want something from the stdlib:
from difflib import SequenceMatcher
s_1 = 'Mohan Mehta'
s_2 = 'Mohan Mehte'
print(SequenceMatcher(a=s_1,b=s_2).ratio())
0.909090909091
fuzzywuzzy is one of numerous libs that you can install, it uses the difflib module with python-Levenshtein. You should also check out the wikipage on Approximate_string_matching

Another approach is to use a "phonetic algorithm":
A phonetic algorithm is an algorithm for indexing of words by their pronunciation.
For example using the soundex algorithm:
>>> import soundex
>>> s = soundex.getInstance()
>>> s.soundex("Umesh Gupta")
'U5213'
>>> s.soundex("Umash Gupte")
'U5213'
>>> s.soundex("Umesh Gupta") == s.soundex("Umash Gupte")
True

What you want is a string distance. There many flavors, but I would recommend starting with the Levenshtein distance.

you might want to look at NLTK (The Natural Language Toolkit), specifically the nltk.metrics package, which implements various string distance algorithms, including the Levenshtein distance mentioned already.

You could split the string and check to see if it contains at least one first/last name that is correct.

// calculate the similarity between 2 strings
public static double similarity(String s1, String s2) {
String longer = s1, shorter = s2;
if (s1.length() < s2.length()) { // longer should always have greater length
longer = s2; shorter = s1;
}
int longerLength = longer.length();
if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
/* // If you have StringUtils, you can use it to calculate the edit distance:
return (longerLength - StringUtils.getLevenshteinDistance(longer, shorter)) /
(double) longerLength; */
return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// Example implementation of the Levenshtein Edit Distance
// See http://rosettacode.org/wiki/Levenshtein_distance#Java
public static int editDistance(String s1, String s2) {
s1 = s1.toLowerCase();
s2 = s2.toLowerCase();
int[] costs = new int[s2.length() + 1];
for (int i = 0; i <= s1.length(); i++) {
int lastValue = i;
for (int j = 0; j <= s2.length(); j++) {
if (i == 0)
costs[j] = j;
else {
if (j > 0) {
int newValue = costs[j - 1];
if (s1.charAt(i - 1) != s2.charAt(j - 1))
newValue = Math.min(Math.min(newValue, lastValue),
costs[j]) + 1;
costs[j - 1] = lastValue;
lastValue = newValue;
}
}
}
if (i > 0)
costs[s2.length()] = lastValue;
}
return costs[s2.length()];
}

Related

output whole list in C#

in python language i can easily do this and output is whole list:
import random
list = [random.randrange(150) for i in range(10)]
print(list)
Can i do this thing in C# language without for cycle like this? Because output seperates my list's elements.
List<int> list = new List<int> ();
Random rnd = new Random();
for (int i = 0; i < 10; i++){
list.Add(rnd.Next (150));
}
for(int i = 0; i < list.Count; i++){
Console.WriteLine(list[i]);
}

Well, we can do it in one line if you want as well. This code is also thread-safe but requires .NET 6.0 or higher due to the use of Random.Shared.
Console.WriteLine(string.Join(",", Enumerable.Range(0, 10).Select(_ => Random.Shared.Next(150))));
This generates an IEnumerable<int> with random integers from 0 to 149 and then writes them to the Console separated by commas.

As far as I know, there is not a method generating a list of random integers in .NET, but why won't you write your own? For example:
public static class MyEnumerable
{
public static IEnumerable<int> RandomEnumerable(int maxValue, int count, Random random = default)
{
if (count < 0)
{
throw new ArgumentOutOfRangeException(nameof(count));
}
if (maxValue < 0)
{
throw new ArgumentOutOfRangeException(nameof(maxValue));
}
random ??= Random.Shared;
for (int i = 0; i < count; i++)
{
yield return random.Next(maxValue);
}
}
}
Now you can do your task in two lines like in phyton:
var randomList = MyEnumerable.RandomEnumerable(150, 10).ToList();
Console.WriteLine($"[{string.Join(", ", randomList)}]");

Clipping a binary number to required length C/C++

I have written a short function to convert an input decimal number to a binary output. However, at a much higher level of the code, the end user should toggle an option as to whether or not they desire a 5B or 10B value. For the sake of some other low level maths, I have to clip the data here.
So I need some help figuring out how to clip the output to a desired length and stuff the required number of leading zeros.
The incomplete C code:
long dec2bin(int x_dec,int res)
{
long x_bin = 0;
int x_bin_len;
int x_rem, i = 1;
while (x_dec != 0)
{
x_rem = x_dec % 2;
x_dec /= 2;
x_bin += x_rem * i;
i *= 10;
}
return x_bin;
}
I had completed a working proof of concept using python. The end application however, requires I write this in C.
The working python script:
def dec2bin(x_dec,x_res):
x_bin = bin(x_dec)[2:] #Convert to Binary (Remove 0B Prefix)
x_len = len(x_bin)
if x_len < x_res: #If Smaller than desired resolution
x_bin = '0' * (x_res-x_len) + x_bin #Stuff with leading 0s
if x_len > x_res: #If larger than desired resolution
x_bin = x_bin[x_len-x_res:x_len] #Display desired no. LSBs
return x_bin
I'm sure this has been done before, Indeed, my python script proves it should be relatively straightforward, but I'm not as experienced with C.
Any help is greatly appreciated.
Mark.

As #yano suggested, I think you have to return an ascii string to the caller, rather than a long. Below's the short function I wrote for my own purposes, for any base...
char *itoa ( int i, int base, int ndigits ) {
static char a[999], digits[99] = /* up to base 65 */
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz#$*";
int n=ndigits;
memset(a,'0',ndigits); a[ndigits]='\000';
while ( --n >= 0) {
a[n] = digits[i%base];
if ( (i/=base) < 1 ) break; }
return ( a );
} /* --- end-of-function itoa() --- */

does C# have something equivalent to Pythons random.choices()

I'm trying to do choices based on their weight/probability
this is what I had in python:
import random
myChoiceList = ["Attack", "Heal", "Amplify", "Defense"]
myWeights = [70, 0, 15, 15] // % probability = 100% Ex. Attack has 70% of selection
print(random.choices(myChoicelist , weights = myWeights, k = 1))
I want to do the same thing in c#, how does one do that?
does C# have any methods similar to random.choices() all I know is random.Next()
*this python code works fine randome.choice takes in (sequence, weights, k)
sequence: values,
weights: A list were you can weigh the possibility for each value,
k: the length of the returned list,
I'm looking to do the same for C#,
choose values based on there probability

There is nothing built into C# like this, however, it's not that hard to add an extension method to recreate the same basic behavior:
static class RandomUtils
{
public static string Choice(this Random rnd, IEnumerable<string> choices, IEnumerable<int> weights)
{
var cumulativeWeight = new List<int>();
int last = 0;
foreach (var cur in weights)
{
last += cur;
cumulativeWeight.Add(last);
}
int choice = rnd.Next(last);
int i = 0;
foreach (var cur in choices)
{
if (choice < cumulativeWeight[i])
{
return cur;
}
i++;
}
return null;
}
}
Then you can call it in a similar way as the Python version:
string[] choices = { "Attack", "Heal", "Amplify", "Defense" };
int[] weights = { 70, 0, 15, 15 };
Random rnd = new Random();
Console.WriteLine(rnd.Choice(choices, weights));

you can get random.next(0,100), then choose the relevant item with a simple switch case or something. your domains will be like this , [0-70 , 70-85, 85-100]. let me know if you need full code.
Random ran = new Random();
int probability = ran.Next(0, 100);
string s;
if (probability == 0)
s = "Heal";
else if (probability <= 70)
s = "Attack";
else if (probability <= 85)
s = "Amplify";
else if (probability <= 100)
s = "Defense";

Optimize cython functions operating on python lists

I am currently migrating to Cython a set of functions that are currently implemented in C++ through scipy.weave (now deprecated).
These functions operate on timeseries points that are 2D-lists (eg. [[17100, 19.2], [17101, 20.7], [17102, 20.3], ...]) both in input and in output. A sample function is subtract that accepts two timeseries and calculates a new timeserie as subtraction of the two inputs going date-by-date.
The structure and the interfaces have to be mantained for retrocompatibility, but my profiling trials show that Cython porting is about 30%-40% slower than the original scipy.weave implementation.
I have tried many ways to optimize (inner conversions to Numpy arrays and memoryviews, C pointers, ...), but the conversion time required lenghtens the overall execution time. Even trying to define input and output as C++ vectors, leveraging on Cython implicit conversions doesn't seem to be effective in order to mantain scipy.weave speed. I have also used the various hints on boundscheck, wraparound, division, ...
The highest slow-downs seem to be on functions that uses nested loops and I've seen that a little gain can be predefining the list size (cdef list target = [[-1, float('nan')]]*size).
I am aware that Cython can't be so much performing on Python structures, especially lists, but are there any other tricks or techniques with which a speedup can be obtained?
=== EDIT - ADD CODE EXAMPLE ===
A good example of the typology of functions is the following.
The function takes a 2-D list of dates/prices and a 2-D list of dates/decimal factors and searches matching dates between the two lists, calculating the output on the corresponding price/factor by multiplying or dividing (that is a third input parameter).
My best-performing cython code:
#cython.cdivision(True)
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef apply_conversion(list original_timeserie, list factor_timeserie, int divide_or_multiply=False):
cdef:
Py_ssize_t i, j = 0, size = len(original_timeserie), size2 = len(factor_timeserie)
long original_date, factor_date
double original_price, factor_price, conv_price
list result = []
for i in range(size):
original_date = original_timeserie[i][0]
for j in range(j, size2):
factor_date = factor_timeserie[j][0]
if original_date == factor_date:
original_price = original_timeserie[i][1]
factor_price = factor_timeserie[j][1]
if divide_or_multiply:
if factor_price != 0:
conv_price = original_price / factor_price
else:
conv_price = float('inf')
else:
conv_price = original_price * factor_price
result.append([original_date, conv_price])
break
return result
The original scipy function:
int len = original_timeserie.length();
int len2 = factor_timeserie.length();
PyObject* py_serieconv = PyList_New(len);
PyObject* original_item = NULL;
PyObject* factor_item = NULL;
PyObject* date = NULL;
PyObject* value = NULL;
long original_date = 0;
long factor_date = 0;
double original_price = 0;
double factor_price = 0;
int j = 0;
for(int i=0;i<len;i++) {
original_item = PyList_GetItem(original_timeserie, i);
date = PyList_GetItem(original_item, 0);
original_date = PyInt_AsLong(date);
original_price = PyFloat_AsDouble( PyList_GetItem(original_item, 1) );
factor_item = NULL;
for(;j<len2;) {
factor_item = PyList_GetItem(factor_timeserie, j++);
factor_date = PyInt_AsLong(PyList_GetItem(factor_item, 0));
if (factor_date == original_date) {
factor_price = PyFloat_AsDouble(PyList_GetItem(factor_item, 1));
value = PyFloat_FromDouble(original_price * (divide_or_multiply==0 ? factor_price : 1/factor_price));
PyObject* py_new_item = PyList_New(2);
Py_XINCREF(date);
PyList_SetItem(py_new_item, 0, date);
PyList_SetItem(py_new_item, 1, value);
PyList_SetItem(py_serieconv, i, py_new_item);
break;
}
}
}
return_val = py_serieconv;
Py_XDECREF(py_serieconv);

Weave Inline C++ Code in Python 2.7

I'm trying to rewrite this function:
def smoothen_fast(heightProfile, travelTime):
smoothingInterval = 30 * travelTime
heightProfile.extend([heightProfile[-1]]*smoothingInterval)
# Get the mean of first `smoothingInterval` items
first_mean = sum(heightProfile[:smoothingInterval]) / smoothingInterval
newHeightProfile = [first_mean]
for i in xrange(len(heightProfile)-smoothingInterval-1):
prev = heightProfile[i] # the item to be subtracted from the sum
new = heightProfile[i+smoothingInterval] # item to be added
# Calculate the sum of previous items by multiplying
# last mean with smoothingInterval
prev_sum = newHeightProfile[-1] * smoothingInterval
new_sum = prev_sum - prev + new
mean = new_sum / smoothingInterval
newHeightProfile.append(mean)
return newHeightProfile
as embedded C++ Code:
import scipy.weave as weave
heightProfile = [0.14,0.148,1.423,4.5]
heightProfileSize = len(heightProfile)
travelTime = 3
code = r"""
#include <string.h>
int smoothingInterval = 30 * travelTime;
double *heightProfileR = new double[heightProfileSize+smoothingInterval];
for (int i = 0; i < heightProfileSize; i++)
{
heightProfileR[i] = heightProfile[i];
}
for (int i = 0; i < smoothingInterval; i++)
{
heightProfileR[heightProfileSize+i] = -1;
}
double mean = 0;
for (int i = 0; i < smoothingInterval; i++)
{
mean += heightProfileR[i];
}
mean = mean/smoothingInterval;
double *heightProfileNew = new double[heightProfileSize-smoothingInterval];
for (int i = 0; i < heightProfileSize-smoothingInterval-1; i++)
{
double prev = heightProfileR[i];
double newp = heightProfile[i+smoothingInterval];
double prev_sum = heightProfileNew[i] * smoothingInterval;
double new_sum = prev_sum - prev + newp;
double meanp = new_sum / smoothingInterval;
heightProfileNew[i+1] = meanp;
}
return_val = Py::new_reference_to(Py::Double(heightProfileNew));
"""
d = weave.inline(code,['heightProfile','heightProfileSize','travelTime'])
As a return type i need the heightProfileNew. I need the access it like a list in Python later.
I look at these examples:
http://docs.scipy.org/doc/scipy/reference/tutorial/weave.html
He keeps telling me that he doesn't know Py::, but in the examples there are no Py-Includes?

I know, the question is old, but I think it is still interesting.
Assuming your using weave to improve computation speed and that you know the length of your output beforehand, I suggest that you create the result before calling inline. That way you can create the result variable in python (very easy). I would also suggest using a nd.ndarray as a result because it makes shure you use the right datatype. You can iterate ndarrays in python the same way you iterate lists.
import numpy as np
heightProfileArray = np.array(heightprofile)
# heightProfileArray = np.array(heightprofile, dtype = np.float32) if you want to make shure you have the right datatype. Another choice would be np.float64
resultArray = np.zeros_like(heightProfileArray) # same array size and data type but filled with zeros
[..]
weave.inline(code,['heightProfile','heightProfileSize','travelTime','resultArray'])
for element in resultArray:
print element
In your C++-code you can then just assign values to elements of that array:
[..]
resultArray[i+1] = 5.5;
[..]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding if two strings are almost similar - python

What you want is a string distance. There many flavors, but I would recommend starting with the Levenshtein distance.

you might want to look at NLTK (The Natural Language Toolkit), specifically the nltk.metrics package, which implements various string distance algorithms, including the Levenshtein distance mentioned already.

You could split the string and check to see if it contains at least one first/last name that is correct.

Related

output whole list in C#

Clipping a binary number to required length C/C++

does C# have something equivalent to Pythons random.choices()

Optimize cython functions operating on python lists

Weave Inline C++ Code in Python 2.7

Categories

Resources