Collapsing set of strings based on a given hamming distance

Collapsing set of strings based on a given hamming distance - python

Given a set of strings (first column) along with counts (second column), e.g.:
aaaa 10
aaab 5
abbb 3
cbbb 2
dbbb 1
cccc 8
Are there any algorithms or even implementations (ideally as a Unix executive, R or python) which collapse this set into a new set based on a given hamming distance.
Collapsing implies adding the count
Strings with a lower count are collapsed into strings with higher counts.
For example say for hamming distance 1, the above set would collapse the second string aaab into aaaa since they are 1 hamming distance apart and aaaa has a higher count.
The collapsed entry would have the combined count, here aaaa 15
For this set, we'd, therefore, get the following collapsed set:
aaaa 15
abbb 6
cccc 8
Ideally, an implementation should be efficient, so even heuristics which do not guarantee an optimal solution would be appreciated.
Further background and motivation
Calculating the hamming distance between 2 strings (a pair) is been implemented in most programming languages. A brute force solution would compute compute the distance between all pairs. Maybe there is no way around it. However e.g. I'd imagine an efficient solutions would avoid calculating the distance for all pairs etc. There are maybe clever ways to save some calculations based on metric theory (since hamming distance is a metric), e.g. if hamming distance between x and z is 3, and x and y is 3, I can avoid calculating between y and z. Maybe there is a clever k-mer approach, or maybe some efficient solution for a constant distance (say d=1).
Even if it there was only a brute force solution, I'd be curious if this has been implemented before and how to use it (ideally without me having to implement it myself).

I thought up the following:
This reports the item with the highest score with the sum of its score and the scores of its near by neighbors. Once a neighbor is used it is not reported separately.
I suggest using a Vantage-point tree as the metric index.
The algorithm would look like this:
construct the metric index from the strings and their scores
construct the max heap from the strings and their scores
for the string with the highest score in the max heap:
use the metric index to find the near by strings
print the string, and the sum of its score and its near by strings
remove from the metric index the string and each of the near by strings
remove from the max heap the string and each of the near by strings
repeat 3-7 until the max heap is empty
Perhaps this could be simplified by using a used table rather than removing anything. The metric space index would not need to have efficient deletion nor would the max heap need to support deletion by value. But this would be slower if the neighborhoods are large and overlap frequently. So efficient deletion might be a necessary difficulty.
construct the metric index from the strings and their scores
construct the max heap from the strings and their scores
construct the used table from an empty set
for the string with the highest score in the max heap:
if this string is in the used table: start over with the next string
use the metric index to find the near by strings
remove any near by strings that are in the used table
print the string, and the sum of its score and its near by strings
add the near by strings to the used table
repeat 4-9 until the max heap is empty
I can not provide a complexity analysis.
I was thinking about the second algorithm. The part that I thought was slow was the checking of the neighborhood against used table. This is not needed as deletion from a Vantage-point tree can be done in linear time. When searching for the neighbors, remember where they were found and then remove them later using these locations. If a neighbor is used as a vantage-point, mark it as removed so that a search will not return it, but leave it alone otherwise. This I think restores it to below quadratic. As otherwise it would be something like number of items times size of neighborhood.
In response to the comment. The problem was "Strings with a lower count are collapsed into strings with higher counts." as such this does compute that. It is not a greedy approximation that could result non-optimal result as there was nothing to maximize or minimize. It is an exact algorithm. It returns the the item with the highest score combined with the score of its neighborhood.
This can be viewed as assigning a leader to each neighborhood such that each item has at most one leader and that leader has the largest overall score so far. This can be viewed as a directed graph.
The specification wasn't for dynamic programming or optimization problem. For that you would ask for the item with the highest score in the highest total scoring neighborhood. That can also be solved in a similar way by changing the ranking function strings from its score to the pair of the sum of its score and its neighborhood, and its score.
It does mean that it can't be solved with a max heap over the scores as removing items affects the neighbors of the neighborhood and one would have to recalculate their neighborhood score before again finding the item with the highest total scoring neighborhood.

Related

How to merge two dataframes by similar (but not matching) values?

I have two dataframes.
player_stats:
player minutes total_points assists
1 Erling Haaland 77 13 0
2 Kevin De Bruyne 90 6 1
and, season_gw1:
player position gw team
10449 Erling Håland 4 1 Manchester City
10453 Kevin De Bruyne 3 1 Manchester City
I want to merge these two dataframes by player, but as you can see, for the first player (Haaland), the word is not spelled exactly the same on both dfs.
This is the code I'm using:
season_gw1_stats = season_gw1.merge(player_stats, on = 'player')
And the resulting df (season_gw1_stats) is this one:
player position gw team minutes total_points assists
10453 Kevin de Bruyne 3 1 Manchester City 90 6 1
How do I merge dfs by similar values? (This is not the full dataframe - I also have some other names that are spelled differently in both dfs but are very similar).

In order to use standard pandas to "merge dataframes"
you will pretty much need to eliminate "similar"
from the problem statement.
So we're faced with mapping to "matching" values
that are identical across dataframes.
Here's a pair of plausible approaches.
normalize each name in isolation
examine quadratic pairwise distances
1. normalize
Map variant spellings down to a smaller universe
of spellings where collisions (matches) are more likely.
There are many approaches:
case smash to lower
map accented vowels to [aeiou]
discard all vowels
use simplifying regexes like s/sch/sh/ and s/de /de/
use Soundex or later competitors like Metaphone
manually curate a restricted vocabulary of correct spellings
Cost is O(N) linear with total length of dataframes.
2. pairwise distances
We wish to canonicalize,
to boil down multiple variant spellings
to a distinguished canonical spelling.
Begin by optionally normalizing,
then sort, at cost of O(N log N), and finally
make a linear pass that outputs only unique names.
This trivial pre-processing step reduces N,
which helps a lot when dealing
with O(N^2) quadratic cost.
Define a distance metric which accepts two names.
When given a pair of identical names it must
report a distance of zero. Otherwise it
deterministically reports a positive real number.
You might use Levenshtein,
or MRA.
Use nested loops
to compare all names against all names.
If distance between names is less than
threshold, arbitrarily declare name1
the winner, overwriting 2nd name with that 1st value.
The effect is to cluster multiple variant spellings
down to a single winning spelling.
Cost is O(N^2) quadratic.
Perhaps you're willing to tinker with the
distance function a bit.
You might give initial letter(s) a heavier weight,
such that mismatched prefix guarantees the distance
shall exceed threshold.
In that case sorted names will help out,
and the nested loop can be confined to
just a small window of similarly prefixed names,
with early termination once it sees the prefix has changed.
Noting the distance between adjacent sorted names
can help with manually choosing a sensible
threshold parameter.
Finally, with adjusted names in hand,
you're in a position to .merge()
using exact equality tests.

What is the fastest method in python of searching for the highest percent Levenshtein distance between a string and a list of strings?

I'm writing a program that compares a smaller list of game titles to a master list of many games to see which games in the smaller list more closely match with the titles of the games in the master list than others. In order to do this, I've been checking the Levenshtein distance (in percent form) between each game in the smaller list and every game in the master list and taking the maximum of all of these values (the lower the maximum percentage, the more unique the game has to be) using both the difflib and the fuzzywuzzy modules. The problem that I'm having is that a typical search using either process.extractOne() or difflib.get_close_matches() takes about 5+ seconds per game (with 38000+ strings in the master list), and I have about 4500 games to search through (5 * 4500 is about 6 hours and 15 minutes, which I don't have time for).
In hopes of finding a better and faster method of searching through a list of strings, I'm asking here what the fastest method in python of searching for the highest percent Levenshtein distance between a string and a list of strings is. If there is no better way than by using the two functions above or writing some other looping code, then please say so.
The two functions I used in specific to search for the highest distance are these:
metric = process.extractOne(name, master_names)[1] / 100
metric = fuzz.ratio(name, difflib.get_close_matches(name, master_names, 1, 0)[0]) / 100

Through experimentation and further research I discovered that the fastest method of checking the Levenshtein ratio is through the python-Levenshtein library itself. The function Levenshtein.ratio() is significantly faster (for one game the entire search takes only 0.05 seconds on average) compared to using any function in fuzzywuzzy or difflib, likely because of its simplicity and C implementation. I used this function in a for loop iterating over every name in the master list to get the best answer:
from Levenshtein import ratio
metric = 0
for master_name in master_names:
new_metric = ratio(name, master_name)
if (new_metric > metric):
metric = new_metric
In conclusion I say that the fastest method of searching for the highest percent Levenshtein distance between a string and a list of strings is to iterate over the list of strings, use Levenshtein.ratio() to get the ratio of each string compared with the first string, and then check for the highest value ratio on each iteration.

An algorithm to stochastically pick the top element of a set with some noise

I'd like to find a method (e.g. in python) which given a sorted list, picks the top element with some error epsilon.
One way would be to pick the top element with probability p < 1 and then the 2nd with p' < p and so on with an exponential decay.
Ideally though I'd like a method that takes into account the winning margin of the top element with some noise. I.e:
Given a list [a,b,c,d,e,....] in which a is the largest element, b the second largest and so on,
Pick the top element with probability p < 1, where p depends on the value of a-b, and p' on the value of b-c and so on.

You can't do exactly that, since, if you have n elements, you will only have n-1 differences between consecutive elements. The standard method of doing something similar is fitness proportionate selection (link provides code in java and ruby, should be fairly easy to translate to other languages).
For other variants of the idea, look up selection operators for genetic algorithms (there are various).

One way to do that is to select element k with probability proportional to exp(-(x[k] - x[0])/T) where x[0] is the least element and T is a free parameter, analogous to temperature. This is inspired by an analogy to thermodynamics, in which low-energy (small x[k]) states are more probable, and high-energy (large x[k]) states are possible, but less probable; the effect of temperature is to focus on just the most probable states (T near zero) or to select from all the elements with nearly equal probability (large T).
The method of simulated annealing is based on this analogy, perhaps you can get some inspiration from that.
EDIT: Note that this method gives nearly-equal probability to elements which have nearly-equal values; from your description, it sounds like that's something you want.
SECOND EDIT: I have it backwards; what I wrote above makes lesser values more probable. Probability proportional to exp(-(x[n - 1] - x[k])/T) where x[n - 1] is the greatest value makes greater values more probable instead.

"Running" weighted average

I'm constantly adding/removing tuples to a list in Python and am interested in the weighted average (not the list itself). Since this part is computationally quite expensive compared to the rest, I want to optimise it. What's the best way of keeping track of the weighted average? I can think of two methods:
keeping the list and calculating the weighted average every time it gets accessed/changed (my current approach)
just keep track of current weighted average and the sum of all weights and change weight and current weighted average for every add/remove action
I would prefer the 2nd option, but I am worried about "floating point errors" induced by constant addition/subtraction. What's the best way of dealing with this?

Try doing it in integers? Python bignums should make a rational argument for rational numbers (sorry, It's late... really sorry actually).
It really depends on how many terms you are using and what your weighting coefficient is as to weather you will experience much floating point drift. You only get 53 bits of precision, you might not need that much.
If your weighting factor is less than 1, then your error should be bounded since you are constantly decreasing it. Let's say your weight is 0.6 (horrible, because you cannot represent that in binary). That is 0.00110011... represented as 0.0011001100110011001101 (rounded in the last bit). So any error you introduce from that rounding, will be then decreased after you multiply again. The error in the most current term will dominate.
Don't do the final division until you need to. Once again given 0.6 as your weight and 10 terms, your term weights will be 99.22903012752124 for the first term all the way down to 1 for the last term (0.6**-t). Multiply your new term by 99.22..., add it to your running sum and subtract the trailing term out, then divide by 246.5725753188031 (sum([0.6**-x for x in range(0,10)])
If you really want to adjust for that, you can add a ULP to the term you are about to remove, but this will just underestimate intentionally, I think.

Here is an answer that retains floating point for keeping a running total - I think a weighted average requires only two running totals:
Allocate an array to store your numbers in, so that inserting a number means finding an empty space in the array and setting it to that value and deleting a number means setting its value in the array to zero and declaring that space empty - you can use a linked list of free entries to find empty entries in time O(1)
Now you need to work out the sum of an array of size N. Treat the array as a full binary tree, as in heapsort, so offset 0 is the root, 1 and 2 are its children, 3 and 4 are the children of 1, 5 and 6 are the children of 2, and so on - the children of i are at 2i+1 and 2i+2.
For each internal node, keep the sum of all entries at or below that node in the tree. Now when you modify an entry you can recalculate the sum of the values in the array by working your way from that entry up to the root of the tree, correcting the partial sums as you go - this costs you O(log N) where N is the length of the array.

Pebbling a Checkerboard with Dynamic Programming

I am trying to teach myself Dynamic Programming, and ran into this problem from MIT.
We are given a checkerboard which has 4 rows and n columns, and
has an integer written in each square. We are also given a set of 2n pebbles, and we want to
place some or all of these on the checkerboard (each pebble can be placed on exactly one square)
so as to maximize the sum of the integers in the squares that are covered by pebbles. There is
one constraint: for a placement of pebbles to be legal, no two of them can be on horizontally or
vertically adjacent squares (diagonal adjacency is ok).
(a) Determine the number of legal patterns that can occur in any column (in isolation, ignoring
the pebbles in adjacent columns) and describe these patterns.
Call two patterns compatible if they can be placed on adjacent columns to form a legal placement.
Let us consider subproblems consisting of the rst k columns 1 k n. Each subproblem can
be assigned a type, which is the pattern occurring in the last column.
(b) Using the notions of compatibility and type, give an O(n)-time dynamic programming algorithm for computing an optimal placement.
Ok, so for part a: There are 8 possible solutions.
For part b, I'm unsure, but this is where I'm headed:
SPlit into sub-problems. Assume i in n.
1. Define Cj[i] to be the optimal value by pebbling columns 0,...,i, such that column i has pattern type j.
2. Create 8 separate arrays of n elements for each pattern type.
I am not sure where to go from here. I realize there are solutions to this problem online, but the solutions don't seem very clear to me.

You're on the right track. As you examine each new column, you will end up computing all possible best-scores up to that point.
Let's say you built your compatibility list (a 2D array) and called it Li[y] such that for each pattern i there are one or more compatible patterns Li[y].
Now, you examine column j. First, you compute that column's isolated scores for each pattern i. Call it Sj[i]. For each pattern i and compatible
pattern x = Li[y], you need to maximize the total score Cj such that Cj[x] = Cj-1[i] + Sj[x]. This is a simple array test and update (if bigger).
In addition, you store the pebbling pattern that led to each score. When you update Cj[x] (ie you increase its score from its present value) then remember the initial and subsequent patterns that caused the update as Pj[x] = i. That says "pattern x gave the best result, given the preceding pattern i".
When you are all done, just find the pattern i with the best score Cn[i]. You can then backtrack using Pj to recover the pebbling pattern from each column that led to this result.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.