How to choose the proper maximum value for Damerau-Levenshtein distance? - python

I am using the Damerau-Levenshtein code available from here in my similarity measurements. The problem is that when I apply the Damerau-Levenshtein on two strings such as cat sat on a mat and dog sat mat, I am getting edit distance as 8. This similarity results can get any number regarding insertion, deletion or substitution like any range from 0, 1, 2, ... . Now I am wondering if there is any way that we can assume or find a maximum of this distance (similarity) and converted between 0 and 1 or how can we set the max value that at least I can say: distance =1 - similarity.
The reason for this post is that I am setting a threshold for a few distance metrics like cosine, Levenstein and damerau levenstein and outputs of all should be betweeb zero and 1.

Levenshtein Distance score = number of insertion + number of deletion + number of substitution.
So the maximum value is 3 X(multiplied) the maximum length string in your data set.

The difficult thing is that the upper bound of Damerau-Levenshtein is infinite (given infinitely long words), but we can't practically make infinite strings.
If you wanted to be safe, you can use something that maps the range 0-> max length of a string onto the range 0->1. The max length of a string depends on the amount of memory you have (assuming 64 bit), so I'd recommend doing...not this. Source
Practically, you can also just check all of the strings you are about to compare and choose the length of the longest string in that list as the max value. Another solution is to compute all of the scores beforehand and apply the conversion factor after you know the max score. Some code that could do that:
def adjustScore(lists, maxNum):
scaleFactor = 1/maxNum
return [x * scaleFactor for x in lists]
testWords = ["test1", "testing2", "you", "must", "construct", "additional", "plyometrics"]
testScores = []
for i in range(len(testWords)-1):
testScores.append(damerau_levenshtein_distance(testWords[i], testWords[i+1]))
#method 1: just check the biggest score you got to obtain the max
max1 = max(testScores)
result = adjustScore(testScores, max1)
#method 2: if you need the adjusted score first, pick the longest string's length as max
lens = map(len, testWords)
max2 = max(lens)
result2 = adjustScore(testScores, max2)
These happen to give identical answers because most of the words are very different from each other but either of these approaches should work for most cases.
Long story short, the max distance between two strings is the length of the longer string.
Notes: if this maps in the wrong direction (i.e. high scores are showing low and vice versa, just add "1-" between the open bracket and x in adjustscore)
Also, if you want it to map do a different range, replace the 1 with a different max value.

Related

Increment through list and find max value for a range of slices

I have a problem in which I need to find the max values of a range of slices.
Ex: I have a list of 5,000 ints, and want to find the maximum mean for each slice from 1 to 3600 elements.
Currently my code is as follows:
power_vals = # some list / array of ints
max_vals = []
for i in range(1, 3600):
max_vals += [max([statistics.mean(power_vals[ix:ix+i]) for ix in range(len(power_vals)) if ix+i < len(power_vals)])]
This works fine but it's really slow (for obvious reasons). I tried to use cython to speed up the process. It's obviously better but still not ideal.
Is there a more time efficient way to do this?
Your first step is to prepend a 0 to the array, and then create a cumulative sum. At this point, calculating the mean from any point to any other point is two substractions followed by a division.
mean(x[i:j]) = (cumsum[j] - cumsum[i])/(j - i)
If you're trying to find the largest mean of, say, length 10, then you can make it even faster by just looking for the largest value of (cumsum[i + 10] - cumsum[i]). Once you've found that largest value, you can then divide it by 10 to get the mean.

How to find the average value in list without considering dominant factors in python?

im trying to find the average value in a list. but sometimes my list contains large numbers that could affect calculating the average number.
list = [215,255,210,205,450,315,235,250,450,250,250,250,210,450,210]
it would make sense that the average would range from 210 to 250, but number such as 450 and 315 could increase the average value. how do automatically remove dominant factors like the number 450 and easily find the correct average number?
The dominant factors you are talking about are called 'outliers' in data that are abnormal values(very high or very low) when compared with the rest of the dataset. You can use the concept of zscore to remove these outliers from your data
from scipy.stats import zscore
list1 = [215,255,210,205,450,315,235,250,450,250,250,250,210,450,210]
score=zscore(list1)
threshold=1 #should be 3 generally
list1 = [value for index,value in enumerate(list1) if abs(score[index])<=threshold ]
You can change the threshold according to your wish & see the list1 you are getting finally to decide for the threshold(do try for a number of values ranging from 0-3).For more on zscore: outlier detection

Collapsing set of strings based on a given hamming distance

Given a set of strings (first column) along with counts (second column), e.g.:
aaaa 10
aaab 5
abbb 3
cbbb 2
dbbb 1
cccc 8
Are there any algorithms or even implementations (ideally as a Unix executive, R or python) which collapse this set into a new set based on a given hamming distance.
Collapsing implies adding the count
Strings with a lower count are collapsed into strings with higher counts.
For example say for hamming distance 1, the above set would collapse the second string aaab into aaaa since they are 1 hamming distance apart and aaaa has a higher count.
The collapsed entry would have the combined count, here aaaa 15
For this set, we'd, therefore, get the following collapsed set:
aaaa 15
abbb 6
cccc 8
Ideally, an implementation should be efficient, so even heuristics which do not guarantee an optimal solution would be appreciated.
Further background and motivation
Calculating the hamming distance between 2 strings (a pair) is been implemented in most programming languages. A brute force solution would compute compute the distance between all pairs. Maybe there is no way around it. However e.g. I'd imagine an efficient solutions would avoid calculating the distance for all pairs etc. There are maybe clever ways to save some calculations based on metric theory (since hamming distance is a metric), e.g. if hamming distance between x and z is 3, and x and y is 3, I can avoid calculating between y and z. Maybe there is a clever k-mer approach, or maybe some efficient solution for a constant distance (say d=1).
Even if it there was only a brute force solution, I'd be curious if this has been implemented before and how to use it (ideally without me having to implement it myself).
I thought up the following:
This reports the item with the highest score with the sum of its score and the scores of its near by neighbors. Once a neighbor is used it is not reported separately.
I suggest using a Vantage-point tree as the metric index.
The algorithm would look like this:
construct the metric index from the strings and their scores
construct the max heap from the strings and their scores
for the string with the highest score in the max heap:
use the metric index to find the near by strings
print the string, and the sum of its score and its near by strings
remove from the metric index the string and each of the near by strings
remove from the max heap the string and each of the near by strings
repeat 3-7 until the max heap is empty
Perhaps this could be simplified by using a used table rather than removing anything. The metric space index would not need to have efficient deletion nor would the max heap need to support deletion by value. But this would be slower if the neighborhoods are large and overlap frequently. So efficient deletion might be a necessary difficulty.
construct the metric index from the strings and their scores
construct the max heap from the strings and their scores
construct the used table from an empty set
for the string with the highest score in the max heap:
if this string is in the used table: start over with the next string
use the metric index to find the near by strings
remove any near by strings that are in the used table
print the string, and the sum of its score and its near by strings
add the near by strings to the used table
repeat 4-9 until the max heap is empty
I can not provide a complexity analysis.
I was thinking about the second algorithm. The part that I thought was slow was the checking of the neighborhood against used table. This is not needed as deletion from a Vantage-point tree can be done in linear time. When searching for the neighbors, remember where they were found and then remove them later using these locations. If a neighbor is used as a vantage-point, mark it as removed so that a search will not return it, but leave it alone otherwise. This I think restores it to below quadratic. As otherwise it would be something like number of items times size of neighborhood.
In response to the comment. The problem was "Strings with a lower count are collapsed into strings with higher counts." as such this does compute that. It is not a greedy approximation that could result non-optimal result as there was nothing to maximize or minimize. It is an exact algorithm. It returns the the item with the highest score combined with the score of its neighborhood.
This can be viewed as assigning a leader to each neighborhood such that each item has at most one leader and that leader has the largest overall score so far. This can be viewed as a directed graph.
The specification wasn't for dynamic programming or optimization problem. For that you would ask for the item with the highest score in the highest total scoring neighborhood. That can also be solved in a similar way by changing the ranking function strings from its score to the pair of the sum of its score and its neighborhood, and its score.
It does mean that it can't be solved with a max heap over the scores as removing items affects the neighbors of the neighborhood and one would have to recalculate their neighborhood score before again finding the item with the highest total scoring neighborhood.

Get the the number of zeros and ones of a binary number in Python

I am trying to solve a binary puzzle, my strategy is to transform a grid in zeros and ones, and what I want to make sure is that every row has the same amount of 0 and 1.
Is there a any way to count how many 1s and 0s a number has without iterating through the number?
What I am currently doing is:
def binary(num, length=4):
return format(num, '#0{}b'.format(length + 2)).replace('0b', '')
n = binary(112, 8)
// '01110000'
and then
n.count('0')
n.count('1')
Is there any more efficient computational (or maths way) of doing that?
What you're looking for is the Hamming weight of a number. In a lower-level language, you'd probably use a nifty SIMD within a register trick or a library function to compute this. In Python, the shortest and most efficient way is to just turn it into a binary string and count the '1's:
def ones(num):
# Note that bin is a built-in
return bin(num).count('1')
You can get the number of zeros by subtracting ones(num) from the total number of digits.
def zeros(num, length):
return length - ones(num)
Demonstration:
>>> bin(17)
'0b10001'
>>> # leading 0b doesn't affect the number of 1s
>>> ones(17)
2
>>> zeros(17, length=6)
4
If the length is moderate (say less than 20), you can use a list as a lookup table.
It's only worth generating the list if you're doing a lot of lookups, but it seems you might in this case.
eg. For a 16 bit table of the 0 count, use this
zeros = [format(n, '016b').count('0') for n in range(1<<16)]
ones = [format(n, '016b').count('1') for n in range(1<<16)]
20 bits still takes under a second to generate on this computer
Edit: this seems slightly faster:
zeros = [20 - bin(n).count('1') for n in range(1<<20)]
ones = [bin(n).count('1') for n in range(1<<20)]

Weighted random selection with and without replacement

Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.
Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.
One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.
Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)
To create the alias lookup:
Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.
Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.
Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)
If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.
Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.
For example, if we run another iteration of 3 and 4, we see
(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned
At runtime:
Get a U(0,1) random number, say binary 0.001100000
bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.
If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.
Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.
A simple approach that hasn't been mentioned here is one proposed in Efraimidis and Spirakis. In python you could select m items from n >= m weighted items with strictly positive weights stored in weights, returning the selected indices, with:
import heapq
import math
import random
def WeightedSelectionWithoutReplacement(weights, m):
elt = [(math.log(random.random()) / weights[i], i) for i in range(len(weights))]
return [x[1] for x in heapq.nlargest(m, elt)]
This is very similar in structure to the first approach proposed by Nick Johnson. Unfortunately, that approach is biased in selecting the elements (see the comments on the method). Efraimidis and Spirakis proved that their approach is equivalent to random sampling without replacement in the linked paper.
Here's what I came up with for weighted selection without replacement:
def WeightedSelectionWithoutReplacement(l, n):
"""Selects without replacement n random elements from a list of (weight, item) tuples."""
l = sorted((random.random() * x[0], x[1]) for x in l)
return l[-n:]
This is O(m log m) on the number of items in the list to be selected from. I'm fairly certain this will weight items correctly, though I haven't verified it in any formal sense.
Here's what I came up with for weighted selection with replacement:
def WeightedSelectionWithReplacement(l, n):
"""Selects with replacement n random elements from a list of (weight, item) tuples."""
cuml = []
total_weight = 0.0
for weight, item in l:
total_weight += weight
cuml.append((total_weight, item))
return [cuml[bisect.bisect(cuml, random.random()*total_weight)] for x in range(n)]
This is O(m + n log m), where m is the number of items in the input list, and n is the number of items to be selected.
I'd recommend you start by looking at section 3.4.2 of Donald Knuth's Seminumerical Algorithms.
If your arrays are large, there are more efficient algorithms in chapter 3 of Principles of Random Variate Generation by John Dagpunar. If your arrays are not terribly large or you're not concerned with squeezing out as much efficiency as possible, the simpler algorithms in Knuth are probably fine.
It is possible to do Weighted Random Selection with replacement in O(1) time, after first creating an additional O(N)-sized data structure in O(N) time. The algorithm is based on the Alias Method developed by Walker and Vose, which is well described here.
The essential idea is that each bin in a histogram would be chosen with probability 1/N by a uniform RNG. So we will walk through it, and for any underpopulated bin which would would receive excess hits, assign the excess to an overpopulated bin. For each bin, we store the percentage of hits which belong to it, and the partner bin for the excess. This version tracks small and large bins in place, removing the need for an additional stack. It uses the index of the partner (stored in bucket[1]) as an indicator that they have already been processed.
Here is a minimal python implementation, based on the C implementation here
def prep(weights):
data_sz = len(weights)
factor = data_sz/float(sum(weights))
data = [[w*factor, i] for i,w in enumerate(weights)]
big=0
while big<data_sz and data[big][0]<=1.0: big+=1
for small,bucket in enumerate(data):
if bucket[1] is not small: continue
excess = 1.0 - bucket[0]
while excess > 0:
if big==data_sz: break
bucket[1] = big
bucket = data[big]
bucket[0] -= excess
excess = 1.0 - bucket[0]
if (excess >= 0):
big+=1
while big<data_sz and data[big][0]<=1: big+=1
return data
def sample(data):
r=random.random()*len(data)
idx = int(r)
return data[idx][1] if r-idx > data[idx][0] else idx
Example usage:
TRIALS=1000
weights = [20,1.5,9.8,10,15,10,15.5,10,8,.2];
samples = [0]*len(weights)
data = prep(weights)
for _ in range(int(sum(weights)*TRIALS)):
samples[sample(data)]+=1
result = [float(s)/TRIALS for s in samples]
err = [a-b for a,b in zip(result,weights)]
print(result)
print([round(e,5) for e in err])
print(sum([e*e for e in err]))
The following is a description of random weighted selection of an element of a
set (or multiset, if repeats are allowed), both with and without replacement in O(n) space
and O(log n) time.
It consists of implementing a binary search tree, sorted by the elements to be
selected, where each node of the tree contains:
the element itself (element)
the un-normalized weight of the element (elementweight), and
the sum of all the un-normalized weights of the left-child node and all of
its children (leftbranchweight).
the sum of all the un-normalized weights of the right-child node and all of
its chilren (rightbranchweight).
Then we randomly select an element from the BST by descending down the tree. A
rough description of the algorithm follows. The algorithm is given a node of
the tree. Then the values of leftbranchweight, rightbranchweight,
and elementweight of node is summed, and the weights are divided by this
sum, resulting in the values leftbranchprobability,
rightbranchprobability, and elementprobability, respectively. Then a
random number between 0 and 1 (randomnumber) is obtained.
if the number is less than elementprobability,
remove the element from the BST as normal, updating leftbranchweight
and rightbranchweight of all the necessary nodes, and return the
element.
else if the number is less than (elementprobability + leftbranchweight)
recurse on leftchild (run the algorithm using leftchild as node)
else
recurse on rightchild
When we finally find, using these weights, which element is to be returned, we either simply return it (with replacement) or we remove it and update relevant weights in the tree (without replacement).
DISCLAIMER: The algorithm is rough, and a treatise on the proper implementation
of a BST is not attempted here; rather, it is hoped that this answer will help
those who really need fast weighted selection without replacement (like I do).
This is an old question for which numpy now offers an easy solution so I thought I would mention it. Current version of numpy is version 1.2 and numpy.random.choice allows the sampling to be done with or without replacement and with given weights.
Suppose you want to sample 3 elements without replacement from the list ['white','blue','black','yellow','green'] with a prob. distribution [0.1, 0.2, 0.4, 0.1, 0.2]. Using numpy.random module it is as easy as this:
import numpy.random as rnd
sampling_size = 3
domain = ['white','blue','black','yellow','green']
probs = [.1, .2, .4, .1, .2]
sample = rnd.choice(domain, size=sampling_size, replace=False, p=probs)
# in short: rnd.choice(domain, sampling_size, False, probs)
print(sample)
# Possible output: ['white' 'black' 'blue']
Setting the replace flag to True, you have a sampling with replacement.
More info here:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html#numpy.random.choice
We faced a problem to randomly select K validators of N candidates once per epoch proportionally to their stakes. But this gives us the following problem:
Imagine probabilities of each candidate:
0.1
0.1
0.8
Probabilities of each candidate after 1'000'000 selections 2 of 3 without replacement became:
0.254315
0.256755
0.488930
You should know, those original probabilities are not achievable for 2 of 3 selection without replacement.
But we wish initial probabilities to be a profit distribution probabilities. Else it makes small candidate pools more profitable. So we realized that random selection with replacement would help us – to randomly select >K of N and store also weight of each validator for reward distribution:
std::vector<int> validators;
std::vector<int> weights(n);
int totalWeights = 0;
for (int j = 0; validators.size() < m; j++) {
int value = rand() % likehoodsSum;
for (int i = 0; i < n; i++) {
if (value < likehoods[i]) {
if (weights[i] == 0) {
validators.push_back(i);
}
weights[i]++;
totalWeights++;
break;
}
value -= likehoods[i];
}
}
It gives an almost original distribution of rewards on millions of samples:
0.101230
0.099113
0.799657

Categories