Comparing similarity between multiple strings with a random starting point

Comparing similarity between multiple strings with a random starting point - python

I have a bunch of people names that are tied to their respective Identifying Numbers (e.g. Social Security Number/National ID/Passport Number). Due to duplication though, one Identity Number can have upto 100 names which could be similar or totally different. E.g. ID 221 could have the names Richard Parker, Mary Parker, Aunt May, Parker Richard, M#rrrrryy Richard etc etc. Some typos but some totally different names.
Initially, I want to display only 3 (or a similar small number) of the names that are as different as possible from the rest so as to alert that viewer that the multiple names could not be typos but could be even a case of identity theft or negligent data capture or anything else!
I've read up on an algorithm to detect similarity and am currently looking at this one which would allow you to compute a score and a score of 1 means the two strings are the same while a lower score means they are dissimilar. In my use case, how can I go through say the 100 names and display the 3 that are most dissimilar? The algorithm for that just escapes my mind as I feel like I need a starting point and then look and compare among all others and loop again etc etc

Take the function from https://stackoverflow.com/a/14631287/1082673 as you mentioned and iterate over all combinations in your list. This will work if you have not that many entries, otherwise the computation time can increase pretty fast…
Here is how to generate the pairs for a given list:
import itertools
persons = ['person1', 'person2', 'person3']
for p1, p2 in itertools.combinations(persons, 2):
print "Compare", p1, "and", p2

Related

How to merge two dataframes by similar (but not matching) values?

I have two dataframes.
player_stats:
player minutes total_points assists
1 Erling Haaland 77 13 0
2 Kevin De Bruyne 90 6 1
and, season_gw1:
player position gw team
10449 Erling Håland 4 1 Manchester City
10453 Kevin De Bruyne 3 1 Manchester City
I want to merge these two dataframes by player, but as you can see, for the first player (Haaland), the word is not spelled exactly the same on both dfs.
This is the code I'm using:
season_gw1_stats = season_gw1.merge(player_stats, on = 'player')
And the resulting df (season_gw1_stats) is this one:
player position gw team minutes total_points assists
10453 Kevin de Bruyne 3 1 Manchester City 90 6 1
How do I merge dfs by similar values? (This is not the full dataframe - I also have some other names that are spelled differently in both dfs but are very similar).

In order to use standard pandas to "merge dataframes"
you will pretty much need to eliminate "similar"
from the problem statement.
So we're faced with mapping to "matching" values
that are identical across dataframes.
Here's a pair of plausible approaches.
normalize each name in isolation
examine quadratic pairwise distances
1. normalize
Map variant spellings down to a smaller universe
of spellings where collisions (matches) are more likely.
There are many approaches:
case smash to lower
map accented vowels to [aeiou]
discard all vowels
use simplifying regexes like s/sch/sh/ and s/de /de/
use Soundex or later competitors like Metaphone
manually curate a restricted vocabulary of correct spellings
Cost is O(N) linear with total length of dataframes.
2. pairwise distances
We wish to canonicalize,
to boil down multiple variant spellings
to a distinguished canonical spelling.
Begin by optionally normalizing,
then sort, at cost of O(N log N), and finally
make a linear pass that outputs only unique names.
This trivial pre-processing step reduces N,
which helps a lot when dealing
with O(N^2) quadratic cost.
Define a distance metric which accepts two names.
When given a pair of identical names it must
report a distance of zero. Otherwise it
deterministically reports a positive real number.
You might use Levenshtein,
or MRA.
Use nested loops
to compare all names against all names.
If distance between names is less than
threshold, arbitrarily declare name1
the winner, overwriting 2nd name with that 1st value.
The effect is to cluster multiple variant spellings
down to a single winning spelling.
Cost is O(N^2) quadratic.
Perhaps you're willing to tinker with the
distance function a bit.
You might give initial letter(s) a heavier weight,
such that mismatched prefix guarantees the distance
shall exceed threshold.
In that case sorted names will help out,
and the nested loop can be confined to
just a small window of similarly prefixed names,
with early termination once it sees the prefix has changed.
Noting the distance between adjacent sorted names
can help with manually choosing a sensible
threshold parameter.
Finally, with adjusted names in hand,
you're in a position to .merge()
using exact equality tests.

Getting rid of duplicates in text strings in new column by identifying differences in original data and using this difference in new column

I have sort of more general question on the process on working with text data.
My goal is to create UNIQUE short labels/description on products from existing long descriptions based on specific rules.
In practice it looks like this. I get the data that you see in column Existing Long Description and based on rules and loops in python I changed it to the data in "New_Label" column.
Existing_Long_Description
New_Label
Edge protector clamping range 1-2 mm length 10 m width 6.5 mm height 9.5 mm blac
Edge protector BLACK RNG 1-2MM L=10M
Edge protector clamping range 1-2 mm length 10 m width 6.5 mm height 9.5 mm red
Edge protector RED RNG 1-2MM L=10M
This shortening to the desired format is not a problem. The problem starts when checking uniqueness of "New_label" column. Due to this shortening I might create duplicates:
Existing_Long_Description
New_Label
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=1
Draw-in collet chuck dm 1-10MM
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=6
Draw-in collet chuck dm 1-10MM
Draw-in collet chuck ER DIN 69871AD clamping dm 1-10MM SK40 projecting L=8
Draw-in collet chuck dm 1-10MM
To solve this I need to add some distinguishing factor to my New_Label column based on the difference in Existing_Long_Description.
The problem is that it might not be between unknown number of articles.
I thought about following process:
Identify the duplicates in Existing_Long_description = if there are duplicates, I will know those cant be solved in New_Label
Identify the duplicates in New_Label column and if they are not in selection above = I know these can be solved
For these that can be solved I need to run some distinguisher to find where they differ and extract this difference into other column to elaborate later on what to use to New_label column
Does what I want to do make sense? As I am doing it for the first time I am wondering - is there any way of working that you recommend me?
I read some articles like this: Find the similarity metric between two strings
or elsewhere in stackoverflow I read about this: https://docs.python.org/3/library/difflib.html
That I am planning to use but still it feels rather ineffective to me and maybe here is someone who can help me.
Thanks!

A relational database would be a good fit for this problem,
with appropriate UNIQUE indexes configured.
But let's assume you're going to solve it in memory, rather than on disk.
Assume that get_longs() will read long descriptions from your data source.
dup long descriptions
Avoid processing like this:
longs = []
for long in get_longs():
if long not in longs:
longs.append(long)
Why?
It is quadratic, running in O(N^2) time, for N descriptions.
Each in takes linear O(N) time,
and we perform N such operations on the list.
To process 1000 parts would regrettably require a million operations.
Instead, take care to use an appropriate data structure, a set:
longs = set(get_longs())
That's enough to quickly de-dup the long descriptions, in linear time.
dup short descriptions
Now the fun begins.
You explained that you already have a function that works like a champ.
But we must adjust its output in the case of collisions.
class Dedup:
def __init__(self):
self.short_to_long = {}
def get_shorts(self):
"""Produces unique short descriptions."""
for long in sorted(set(get_longs())):
short = summary(long)
orig_long = self.short_to_long.get(short)
if orig_long:
short = self.deconflict(short, orig_long, long)
self.short_to_long[short] = long
yield short
def deconflict(self, short, orig_long, long):
"""Produces a novel short description that won't conflict with existing ones."""
for word in sorted(set(long.split()) - set(orig_long.split())):
short += f' {word}'
if short not in self.short_to_long: # Yay, we win!
return short
# Boo, we lose.
raise ValueError(f"Sorry, can't find a good description: {short}\n{orig_long}\n{long}")
The expression that subtracts one set from another is answering the question,
"What words in long would help me to uniqueify this result?"
Now of course, some of them may have already been used
by other short descriptions, so we take care to check for that.
Given several long descriptions
that collide in the way you're concerned about,
the 1st one will have the shortest description,
and ones appearing later will tend to have longer "short" descriptions.
The approach above is a bit simplistic, but it should get you started.
It does not, for example, distinguish between "claw hammer" and "hammer claw".
Both strings survive initial uniqueification,
but then there's no more words to help with deconflicting.
For your use case the approach above is likely to be "good enough".

Genetic Algorithm / AI; basically, am I on the right track?

I know Python isn't the best idea to be writing any kind of software of this nature. My reasoning is to use this type of algorithm for a Raspberry Pi 3 in it's decision making (still unsure how that will go), and the libraries and APIs that I'll be using (Adafruit motor HATs, Google services, OpenCV, various sensors, etc) all play nicely for importing in Python, not to mention I'm just more comfortable in this environment for the rPi specifically. Already I've cursed it as object oriented such as Java or C++ just makes more sense to me, but Id rather deal with its inefficiencies and focus on the bigger picture of integration for the rPi.
I won't explain the code here, as it's pretty well documented in the comment sections throughout the script. My questions is as stated above; can this be considered basically a genetic algorithm? If not, what must it have to be a basic AI or genetic code? Am I on the right track for this type of problem solving? I know usually there are weighted variables and functions to promote "survival of the fittest", but that can be popped in as needed, I think.
I've read up quite a bit of forums and articles about this topic. I didn't want to copy someone else's code that I barely understand and start using it as a base for a larger project of mine; I want to know exactly how it works so I'm not confused as to why something isn't working out along the way. So, I just tried to comprehend the basic idea of how it works, and write how I interpreted it. Please remember I'd like to stay in Python for this. I know rPi's have multiple environments for C++, Java, etc, but as stated before, most hardware components I'm using have only Python APIs for implementation. if I'm wrong, explain at the algorithmic level, not just with a block of code (again, I really want to understand the process). Also, please don't nitpick code conventions unless it's pertinent to my problem, everyone has a style and this is just a sketch up for now. Here it is, and thanks for reading!
# Created by X3r0, 7/3/2016
# Basic genetic algorithm utilizing a two dimensional array system.
# the 'DNA' is the larger array, and the 'gene' is a smaller array as an element
# of the DNA. There exists no weighted algorithms, or statistical tracking to
# make the program more efficient yet; it is straightforwardly random and solves
# its problem randomly. At this stage, only the base element is iterated over.
# Basic Idea:
# 1) User inputs constraints onto array
# 2) Gene population is created at random given user constraints
# 3) DNA is created with randomized genes ( will never randomize after )
# a) Target DNA is created with loop control variable as data (basically just for some target structure)
# 4) CheckDNA() starts with base gene from DNA, and will recurse until gene matches the target gene
# a) Randomly select two genes from DNA
# b) Create a candidate gene by splicing both parent genes together
# c) Check candidate gene against the target gene
# d) If there exists a match in gene elements, a child gene is created and inserted into DNA
# e) If the child gene in DNA is not equal to target gene, recurse until it is
import random
DNAsize = 32
geneSize = 5
geneDiversity = 9
geneSplit = 4
numRecursions = 0
DNA = []
targetDNA = []
def init():
global DNAsize, geneSize, geneDiversity, geneSplit, DNA
print("This is a very basic form of genetic software. Input variable constraints below. "
"Good starting points are: DNA strand size (array size): 32, gene size (sub array size: 5, gene diversity (randomized 0 - x): 5"
"gene split (where to split gene array for splicing): 2")
DNAsize = int(input('Enter DNA strand size: '))
geneSize = int(input('Enter gene size: '))
geneDiversity = int(input('Enter gene diversity: '))
geneSplit = int(input('Enter gene split: '))
# initializes the gene population, and kicks off
# checkDNA recursion
initPop()
checkDNA(DNA[0])
def initPop():
# builds an array of smaller arrays
# given DNAsize
for x in range(DNAsize):
buildDNA()
# builds the goal array with a recurring
# numerical pattern, in this case just the loop
# control variable
buildTargetDNA(x)
def buildDNA():
newGene = []
# builds a smaller array (gene) using a given geneSize
# and randomized with vaules 0 - [given geneDiversity]
for x in range(geneSize):
newGene.append(random.randint(0,geneDiversity))
# append the built array to the larger array
DNA.append(newGene)
def buildTargetDNA(x):
# builds the target array, iterating with x as a loop
# control from the call in init()
newGene = []
for y in range(geneSize):
newGene.append(x)
targetDNA.append(newGene)
def checkDNA(childGene):
global numRecursions
numRecursions = numRecursions+1
gene = DNA[0]
targetGene = targetDNA[0]
parentGeneA = DNA[random.randint(0,DNAsize-1)] # randomly selects an array (gene) from larger array (DNA)
parentGeneB = DNA[random.randint(0,DNAsize-1)]
pos = random.randint(geneSplit-1,geneSplit+1) # randomly selects a position to split gene for splicing
candidateGene = parentGeneA[:pos] + parentGeneB[pos:] # spliced gene given split from parentA and parentB
print("DNA Splice Position: " + str(pos))
print("Element A: " + str(parentGeneA))
print("Element B: " + str(parentGeneB))
print("Candidate Element: " + str(candidateGene))
print("Target DNA: " + str(targetDNA))
print("Old DNA: " + str(DNA))
# iterates over the candidate gene and compares each element to the target gene
# if the candidate gene element hits a target gene element, the resulting child
# gene is created
for x in range(geneSize):
#if candidateGene[x] != targetGene[x]:
#print("false ")
if candidateGene[x] == targetGene[x]:
#print("true ")
childGene.pop(x)
childGene.insert(x, candidateGene[x])
# if the child gene isn't quite equal to the target, and recursion hasn't reached
# a max (apparently 900), the child gene is inserted into the DNA. Recursion occurs
# until the child gene equals the target gene, or max recursuion depth is exceeded
if childGene != targetGene and numRecursions < 900:
DNA.pop(0)
DNA.insert(0, childGene)
print("New DNA: " + str(DNA))
print(numRecursions)
checkDNA(childGene)
init()
print("Final DNA: " + str(DNA))
print("Number of generations (recursions): " + str(numRecursions))

I'm working with evolutionary computation right now so I hope my answer will be helpful for you, personally, I work with java, mostly because is one of my main languages, and for the portability, because I tested in linux, windows and mac. In my case I work with permutation encoding, but if you are still learning how GA works, I strongly recommend binary encoding. This is what I called my InitialPopulation. I try to describe my program's workflow:
1-. Set my main variables
This are PopulationSize, IndividualSize, MutationRate, CrossoverRate. Also you need to create an objective function and decide the crossover method you use. For this example lets say that my PopulationSize is equals to 50, the IndividualSize is 4, MutationRate is 0.04%, CrossoverRate is 90% and the crossover method will be roulette wheel.
My objective function only what to check if my Individuals are capable to represent the number 15 in binary, so the best individual must be 1111.
2-. Initialize my Population
For this I create 50 individuals (50 is given by my PopulationSize) with random genes.
3-. Loop starts
For each Individuals in Population you need to:
Evaluate fitness according to the objective function. If an Individual is represented by the next characters: 00100 this mean that his fitness is 1. As you can see this is a simple fitness function. You can create your own while you are learning, like fitness = 1/numberOfOnes. Also you need to assign the sum of all the fitness to a variable called populationFitness, this will be useful in the next step.
Select the best individuals. For this task there's a lot of methods you can use, but we will use the roulette wheel method as we say before. In this method, You assign a value to every individual inside your population. This value is given by the next formula: (fitness/populationFitness) * 100. So, if your population fitness is 10, and a certain individual fitness is 3, this mean that this individual has a 30% chance to be selected to make a crossover with another individual. Also, if another individual have a 4 in his fitness, his value will be 40%.
Apply crossover. Once you have the "best" individuals of your population, you need to create a new population. This new population is formed by others individuals of the previous population. For each individual you create a random number from 0 to 1. If this numbers is in the range of 0.9 (since our crossoverRate = 90%), this individual can reproduce, so you select another individual. Each new individual has this 2 parents who inherit his genes. For example:
Lets say that parentA = 1001 and parentB = 0111. We need to create a new individual with this genes. There's a lot of methods to do this, uniform crossover, single point crossover, two point crossover, etc. We will use the single point crossover. In this method we choose a random point between the first gene and the last gene. Then, we create a new individual according to the first genes of parentA and the last genes of parentB. In a visual form:
parentA = 1001
parentB = 0111
crossoverPoint = 2
newIndividual = 1011
As you can see, the new individual share his parents genes.
Once you have a new population with new individuals, you apply the mutation. In this case, for each individual in the new population generate a random number between 0 and 1. If this number is in the range of 0.04 (since our mutationRate = 0.04), you apply the mutation in a random gene. In binary encoding the mutation is just change the 1's for 0's or viceversa. In a visual form:
individual = 1011
randomPoint = 3
mutatedIndividual = 1010
Get the best individual
If this individual has reached the solution stop. Else, repeat the loop
End
As you can see, my english is not very good, but I hope you understand the basic idea of a genetic algorithm. If you are truly interested in learning this, you can check the following links:
http://www.obitko.com/tutorials/genetic-algorithms/
This link explains in a clearer way the basics of a genetic algorithm
http://natureofcode.com/book/chapter-9-the-evolution-of-code/
This book also explain what a GA is, but also provide some code in Processing, basically java. But I think you can understand.
Also I would recommend the following books:
An Introduction to Genetic Algorithms - Melanie Mitchell
Evolutionary algorithms in theory and practice - Thomas Bäck
Introduction to genetic algorithms - S. N. Sivanandam
If you have no money, you can easily find all this books in PDF.
Also, you can always search for articles in scholar.google.com
Almost all are free to download.

Just to add a bit to Alberto's great answer, you need to watch out for two issues as your solution evolves.
The first one is Over-fitting. This basically means that your solution is complex enough to "learn" all samples, but it is not applicable outside the training set. To avoid this, your need to make sure that the "amount" of information in your training set is a lot larger than the amount of information that can fit in your solution.
The second problem is Plateaus. There are cases where you would arrive at certain mediocre solutions that are nonetheless, good enough to "outcompete" any emerging solution, so your progress stalls (one way to see this is, if you see your fitness "stuck" at a certain, less than optimal number). One method for dealing with this is Extinctions: You could track the rate of improvement of your optimal solution, and if the improvement has been 0 for the last N generations, you just Nuke your population. (That is, delete your population and the list of optimal individuals and start over). Randomness will make it so that eventually the Solutions will surpass the Plateau.
Another thing to keep in mind, is that the default Random class is really bad at Randomness. I have had solutions improve dramatically by simply using something like the Mesernne Twister Random generator or a Hardware Entropy Generator.
I hope this helps. Good luck.

Ways to create unique single elimination brackets

I'm looking for a method to create all possible unique starting positions for a N-player (N is a power of 2) single elimination (knockout) bracket tournament.
Lets say we have players 'A', 'B', 'C', and 'D' and want to find out all possible initial positions. The tournament would then look tike this:
A vs B, C vs D. Then winner(AB) vs winner(CD).
(I will use the notation (A,B,C,D) for the setup above)
Those would simply be all possible permutations of 4 elements, there are 4!=24 of those, and it's easy to generate them.
But they wouldn't be unique for the Tournament, since
(A,B,C,D), (B,A,C,D), (B,A,D,C), (C,D,A,B), ...
would all lead to the same matches being played.
In this case, the set of unique setups is, I think:
(A,B,C,D), (A,C,B,D), (A,D,C,B)
All other combinations would be "symmetric".
Now my questions would be for the general case of N=2^d players:
how many such unique setups are there?
is this a known problem I could look up? Haven't found it yet.
is there a method to generate them all?
how would this method look in python
(questions ranked by perceived usefulness)
I have stumpled upon this entry, but it does not really deal with the problem I'm discussing here.

how many such unique setups are there?
Let there be n teams. There are n! ways to list them in order. We'll start with that. Then deal with the over-counting.
Say we have 8 teams. One possibility is
ABCDEFGH
Swapping teams 1 and 2 won't make a difference. We can have
BACDEFGH
and the same teams play.Divide by 2 to account for that. Swapping 3 and 4 won't either. Divide by 2 again. Same with teams 5 and 6. Total there are 4 groups of 2 (4 matches in the first round). So we take n!, and divide by 2^(n/2).
But here is the thing. We can have order
CDABEFGH
In this example, we are swapping the first two with third and fourth. CDABEFGH is indistinguishable from ABCDEFGH for the purpose of this. So here, we can divide by 2^(n/4).
The same can happen over and over again. At the end, the total number of starting positions should be n!/(2^(n-1)).
We can also think of it a bit different. If we look at https://stackoverflow.com/posts/2269581/revisions, we can also think of it as a tree.
a b (runner up)
a e
a c e h
a b c d e f h g
Here there are 8! ways for us to arrange all the letters at the base, determining one way for the bracket to work out. If we are looking at the starting position, it doesn't matter who won. There were a total of 7 games (and each of the games could have turned out differently), so we divide by 2^7 to account for that over counting.

Pebbling a Checkerboard with Dynamic Programming

I am trying to teach myself Dynamic Programming, and ran into this problem from MIT.
We are given a checkerboard which has 4 rows and n columns, and
has an integer written in each square. We are also given a set of 2n pebbles, and we want to
place some or all of these on the checkerboard (each pebble can be placed on exactly one square)
so as to maximize the sum of the integers in the squares that are covered by pebbles. There is
one constraint: for a placement of pebbles to be legal, no two of them can be on horizontally or
vertically adjacent squares (diagonal adjacency is ok).
(a) Determine the number of legal patterns that can occur in any column (in isolation, ignoring
the pebbles in adjacent columns) and describe these patterns.
Call two patterns compatible if they can be placed on adjacent columns to form a legal placement.
Let us consider subproblems consisting of the rst k columns 1 k n. Each subproblem can
be assigned a type, which is the pattern occurring in the last column.
(b) Using the notions of compatibility and type, give an O(n)-time dynamic programming algorithm for computing an optimal placement.
Ok, so for part a: There are 8 possible solutions.
For part b, I'm unsure, but this is where I'm headed:
SPlit into sub-problems. Assume i in n.
1. Define Cj[i] to be the optimal value by pebbling columns 0,...,i, such that column i has pattern type j.
2. Create 8 separate arrays of n elements for each pattern type.
I am not sure where to go from here. I realize there are solutions to this problem online, but the solutions don't seem very clear to me.

You're on the right track. As you examine each new column, you will end up computing all possible best-scores up to that point.
Let's say you built your compatibility list (a 2D array) and called it Li[y] such that for each pattern i there are one or more compatible patterns Li[y].
Now, you examine column j. First, you compute that column's isolated scores for each pattern i. Call it Sj[i]. For each pattern i and compatible
pattern x = Li[y], you need to maximize the total score Cj such that Cj[x] = Cj-1[i] + Sj[x]. This is a simple array test and update (if bigger).
In addition, you store the pebbling pattern that led to each score. When you update Cj[x] (ie you increase its score from its present value) then remember the initial and subsequent patterns that caused the update as Pj[x] = i. That says "pattern x gave the best result, given the preceding pattern i".
When you are all done, just find the pattern i with the best score Cn[i]. You can then backtrack using Pj to recover the pebbling pattern from each column that led to this result.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.