Euclidian Distance Python Implementation - python

I am playing with the following code from programming collective intelligence, this is a function from the book that calculated eclidian distance between two movie critics.
This function sums the difference of the rankings in the dictionary, but euclidean distance in n dimensions also includes the square root of that sum.
AFAIK since we use the same function to rank everyone it does not matter we square root or not, but i was wondering is there a particular reason for that?
from math import sqrt
# Returns a distance-based similarity score for person1 and person2
def sim_distance(prefs,person1,person2):
# Get the list of shared_items
si={}
for item in prefs[person1]:
if item in prefs[person2]:
si[item]=1
# if they have no ratings in common, return 0
if len(si)==0: return 0
# Add up the squares of all the differences
sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
for item in prefs[person1] if item in prefs[person2]])
return 1/(1+sum_of_squares)

The reason the square root is not used is because it is computationally expensive; it is monotonic (i.e., it preserves order) with the square function, so if all you're interested in is the order of the distances, the square root is unnecessary (and, as mentioned, very expensive computationally).

That's correct. While the square root is necessary for a quantitatively correct result, if all you care about is distance relative to others for sorting, then taking the square root is superfluous.

To compute a Cartesian distance, first you must compute the distance-squared, then you take its square root. But computing a square root is computationally expensive. If all you're really interested in is comparing distances, it works just as well to compare the distance-squared--and it's much faster.
For every two real numbers A and B, where A and B are >= zero, it's always true that A-squared and B-squared have the same relationship as A and B:
if A < B, then A-squared < B-squared.
if A == B, then A-squared == B-squared.
if A > B, then A-squared > B-squared.
Since distances are always >= 0 this relationship means comparing distance-squared gives you the same answer as comparing distance.

Just for intercomparisons the square root is not necessary and you would get the squared euclidean distance... which is also a distance (mathematically speaking, see http://en.wikipedia.org/wiki/Metric_%28mathematics%29).

Related

How to iterate through the Cartesian product of ten lists (ten elements each) faster? (Probability and Dice)

I'm trying to solve this task.
I wrote function for this purpose which uses itertools.product() for Cartesian product of input iterables:
def probability(dice_number, sides, target):
from itertools import product
from decimal import Decimal
FOUR_PLACES = Decimal('0.0001')
total_number_of_experiment_outcomes = sides ** dice_number
target_hits = 0
sides_combinations = product(range(1, sides+1), repeat=dice_number)
for side_combination in sides_combinations:
if sum(side_combination) == target:
target_hits += 1
p = Decimal(str(target_hits / total_number_of_experiment_outcomes)).quantize(FOUR_PLACES)
return float(p)
When calling probability(2, 6, 3) output is 0.0556, so works fine.
But calling probability(10, 10, 50) calculates veeery long (hours?), but there must be a better way:)
for side_combination in sides_combinations: takes to long to iterate through huge number of sides_combinations.
Please, can you help me to find out how to speed up calculation of result, i want too sleep tonight..
I guess the problem is to find the distribution of the sum of dice. An efficient way to do that is via discrete convolution. The distribution of the sum of variables is the convolution of their probability mass functions (or densities, in the continuous case). Convolution is an n-ary operator, so you can compute it conveniently just two pmf's at a time (the current distribution of the total so far, and the next one in the list). Then from the final result, you can read off the probabilities for each possible total. The first element in the result is the probability of the smallest possible total, and the last element is the probability of the largest possible total. In between you can figure out which one corresponds to the particular sum you're looking for.
The hard part of this is the convolution, so work on that first. It's just a simple summation, but it's just a little tricky to get the limits of the summation correct. My advice is to work with integers or rationals so you can do exact arithmetic.
After that you just need to construct an appropriate pmf for each input die. The input is just [1, 1, 1, ... 1] if you're using integers (you'll have to normalize eventually) or [1/n, 1/n, 1/n, ..., 1/n] if rationals, where n = number of faces. Also you'll need to label the indices of the output correctly -- again this is just a little tricky to get it right.
Convolution is a very general approach for summations of variables. It can be made even more efficient by implementing convolution via the fast Fourier transform, since FFT(conv(A, B)) = FFT(A) FFT(B). But at this point I don't think you need to worry about that.
If someone still interested in solution which avoids very-very-very long iteration process through all itertools.product Cartesian products, here it is:
def probability(dice_number, sides, target):
if dice_number == 1:
return (1 <= target <= sides**dice_number) / sides
return sum([probability(dice_number-1, sides, target-x) \
for x in range(1,sides+1)]) / sides
But you should add caching of probability function results, if you won't - calculation of probability will takes very-very-very long time as well)
P.S. this code is 100% not mine, i took it from the internet, i'm not such smart to product it by myself, hope you'll enjoy it as much as i.

An algorithm to stochastically pick the top element of a set with some noise

I'd like to find a method (e.g. in python) which given a sorted list, picks the top element with some error epsilon.
One way would be to pick the top element with probability p < 1 and then the 2nd with p' < p and so on with an exponential decay.
Ideally though I'd like a method that takes into account the winning margin of the top element with some noise. I.e:
Given a list [a,b,c,d,e,....] in which a is the largest element, b the second largest and so on,
Pick the top element with probability p < 1, where p depends on the value of a-b, and p' on the value of b-c and so on.
You can't do exactly that, since, if you have n elements, you will only have n-1 differences between consecutive elements. The standard method of doing something similar is fitness proportionate selection (link provides code in java and ruby, should be fairly easy to translate to other languages).
For other variants of the idea, look up selection operators for genetic algorithms (there are various).
One way to do that is to select element k with probability proportional to exp(-(x[k] - x[0])/T) where x[0] is the least element and T is a free parameter, analogous to temperature. This is inspired by an analogy to thermodynamics, in which low-energy (small x[k]) states are more probable, and high-energy (large x[k]) states are possible, but less probable; the effect of temperature is to focus on just the most probable states (T near zero) or to select from all the elements with nearly equal probability (large T).
The method of simulated annealing is based on this analogy, perhaps you can get some inspiration from that.
EDIT: Note that this method gives nearly-equal probability to elements which have nearly-equal values; from your description, it sounds like that's something you want.
SECOND EDIT: I have it backwards; what I wrote above makes lesser values more probable. Probability proportional to exp(-(x[n - 1] - x[k])/T) where x[n - 1] is the greatest value makes greater values more probable instead.

Python Hellinger formula explanation

I was looking up some formulas for Hellinger's distance between distributions, and I found one (in Python) that I've never seen similar format for. I am confused how it works.
def hellinger(p,q):
"""Hellinger distance between distributions"""
return sum([(sqrt(t[0])-sqrt(t[1]))*(sqrt(t[0])-sqrt(t[1]))\
for t in zip(p,q)])/sqrt(2.)
I've never seen this kind of... format before. They are dividing by a for statement? I mean.. how does this even work?
I have a faible for distance measures, hence I made a notebook with some implementations of Hellinger distance.
Regarding your question, the construct is called a list comrehension and the backslash is just for line continuation.
Here is a possible listing without list comprehension:
def hellinger_explicit(p, q):
"""Hellinger distance between two discrete distributions.
Same as original version but without list comprehension
"""
list_of_squares = []
for p_i, q_i in zip(p, q):
# caluclate the square of the difference of ith distr elements
s = (math.sqrt(p_i) - math.sqrt(q_i)) ** 2
# append
list_of_squares.append(s)
# calculate sum of squares
sosq = sum(list_of_squares)
return sosq / math.sqrt(2)

root finding in python

Edit: A big issue here is that scipy.optimize.brentq requires that the limits of the search interval have opposite sign. If you slice up your search interval into arbitrary sections and run brentq on each section as I do below and as Dan does in the comments, you wind up throwing a lot of useless ValueErrors. Is there a slick way to deal with this in Python?
Original post:
I'm repeatedly searching functions for their largest zero in python. Right now I'm using scipy.optimize.brentq to look for a root and then using a brutish search method if my initial bounds don't work:
#function to find the largest root of f
def bigRoot(func, pars):
try:
root = brentq(func,0.001,4,pars)
except ValueError:
s = 0.1
while True:
try:
root = brentq(func,4-s,4,pars)
break
except ValueError:
s += 0.1
continue
return root
There are two big problems with this.
First I assume that if there are multiple roots in an interval, brentq will return the largest. I did some simple testing and I've never seen it return anything except the largest root but I don't know if that's true in all cases.
The second problem is that in the script I'm using this function will always return zero in certain cases even though the function I pass to bigRoot diverges at 0. If I change the step size of the search from 0.1 to 0.01 then it will return a constant nonzero value in those cases. I realize the details of that depend on the function I'm passing to bigRoot but I think the problem might be with the way I'm doing the search.
The question is, what's a smarter way to look for the largest root of a function in python?
Thanks Dan; a little more info is below as requested.
The functions I'm searching are well behaved in the regions I'm interested in. An example is plotted below (code at the end of the post).
The only singular point is at 0 (the peak that goes off the top of the plot is finite) and there are either two or three roots. The largest root usually isn't greater than 1 but it never does anything like run away to infinity. The intervals between roots get smaller at the low end of the domain but they'll never become extremely small (I'd say they'll always be larger than 10^-3).
from numpy import exp as e
#this isn't the function I plotted
def V(r):
return 27.2*(
23.2*e(-43.8*r) +
8.74E-9*e(-32.9*r)/r**6 -
5.98E-6*e(-0.116*r)/r**4 +
0.0529*( 23*e(-62.5*r) - 6.44*e(-32*r) )/r -
29.3*e(-59.5*r)
)
#this is the definition of the function in the plot
def f(r,b,E):
return 1 - b**2/r**2 - V(r)/E
#the plot is of f(r,0.1,0.06)
Good question, but it's a math problem rather than a Python problem.
In the absence of an analytic formula for the roots of a function, there's no way to guarantee that you've found the largest root of that function, even on a given finite interval. For example, I can construct a function which oscillates between &pm;1 faster and faster as it approaches 1.
f(x) = sin(1/(1-x))
This would bork any numerical method that tries to find the largest root on the interval [0,1), since for any root there are always larger ones in the interval.
So you'll have to give some background about the characteristics of the functions in question in order to get any more insight into this general problem.
UPDATE: Looks like the functions are well-behaved. The brentq docs suggest there is no guarantee of finding the largest/smallest root in the interval. Try partitioning the intervals and recursively searching for smaller and larger other roots.
from scipy.optimize import brentq
# This function should recursively find ALL the roots in the interval
# and return them ordered from smallest to largest.
from scipy.optimize import brentq
def find_all_roots(f, a, b, pars=(), min_window=0.01):
try:
one_root = brentq(f, a, b, pars)
print "Root at %g in [%g,%g] interval" % (one_root, a, b)
except ValueError:
print "No root in [%g,%g] interval" % (a, b)
return [] # No root in the interval
if one_root-min_window>a:
lesser_roots = find_all_roots(f, a, one_root-min_window, pars)
else:
lesser_roots = []
if one_root+min_window<b:
greater_roots = find_all_roots(f, one_root+min_window, b, pars)
else:
greater_roots = []
return lesser_roots + [one_root] + greater_roots
I tried this on your function and it finds the largest root, at ~0.14.
There's something tricky about brentq, though:
print find_all_roots(sin, 0, 10, ())
Root at 0 in [0,10] interval
Root at 3.14159 in [0.01,10] interval
No root in [0.01,3.13159] interval
No root in [3.15159,10] interval
[0.0, 3.141592653589793]
The sin function should have roots at 0, π, 2π, 3π. But this approach is only finding the first two. I realized that the problem is right there in the docs: f(a) and f(b) must have opposite signs. It appears that all of the scipy.optimize root-finding functions have the same requirement, so partitioning the intervals arbitrarily won't work.

How do I check if cartesian coordinates make up a rectangle efficiently?

The situation is as follows:
There are N arrays.
In each array (0..N-1) there are (x,y) tuples (cartesian coordinates) stored
The length of each array can be different
I want to extract the subset of coordinate combinations which make up a complete
retangle of size N. In other words; all the cartesian coordinates are adjacent to each other.
Example:
findRectangles({
{*(1,1), (3,5), (6,9)},
{(9,4), *(2,2), (5,5)},
{(5,1)},
{*(1,2), (3,6)},
{*(2,1), (3,3)}
})
yields the following:
[(1,1),(1,2),(2,1),(2,2)],
...,
...(other solutions)...
No two points can come from the same set.
I first just calculated the cartesian product, but this quickly becomes infeasible (my use-case at the moment has 18 arrays of points with each array roughly containing 10 different coordinates).
You can use hashing to great effect:
hash each point (keeping track of which list it is in)
for each pair of points (a,b) and (c,d):
if (a,d) exists in another list, and (c,b) exists in yet another list:
yield rectangle(...)
When I say exists, I mean do something like:
hashesToPoints = {}
for p in points:
hashesToPoints.setdefault(hash(p),set()).add(p)
for p1 in points:
for p2 in points:
p3,p4 = mixCoordinates(p1,p2)
if p3 in hashesToPoints[hash(p3)] and {{p3 doesn't share a bin with p1,p2}}:
if p4 in hashesToPoints[hash(p4)] and {{p4 doesn't share a bin with p1,p2,p3}}:
yield Rectangle(p1,p2)
This is O(#bins^2 * items_per_bin^2)~30000, which is downright speedy in your case of 18 arrays and 10 items_per_bin -- much better than the outer product approach which is... much worse with O(items_per_bin^#bins)~3trillion. =)
minor sidenote:
You can reduce both the base and exponent in your computation by making multiple passes of "pruning". e.g.
remove each point that is not corectilinear with another point in the X or Y direction
then maybe remove each point that is not corectilinear with 2 other points, in both X and Y direction
You can do this by sorting according to the X-coordinate, repeat for the Y-coordinate, in O(P log(P)) time in terms of number of points. You may be able to do this at the same time as the hashing too. If a bad guy is arranging your input, he can make this optimization not work at all. But depending on your distribution you may see significant speedup.
Let XY be your set of arrays. Construct two new sets X and Y, where X equals XY with all arrays sorted to x-coordinate and Y equals XY with all arrays sorted to y-coordinate.
For each point (x0,y0) in any of the arrays in X: find every point (x0,y1) with the same x-coordinate and a different y-coordinate in the remaining arrays from X
For each such pair of points (if it exists): search Y for points (x1,y0) and (x1,y1)
Let C be the size of the largest array. Then sorting all sets takes time O(N*C*log(C)). In step 1, finding a single matching point takes time O(N*log(C)) since all arrays in X are sorted. Finding all such points is in O(C*N), since there are at most C*N points overall. Step 2 takes time O(N*log(C)) since Y is sorted.
Hence, the overall asymptotic runtime is in O(C * N^2 * log(C)^2).
For C==10 and N==18, you'll get roughly 10.000 operations. Multiply that by 2, since I dropped that factor due to Big-O-notation.
The solution has the further benefit of being extremely simple to implement. All you need is arrays, sorting and binary search, the first two of which very likely being built into the language already, and binary search being extremely simple.
Also note that this is the runtime in the worst case where all rectangles start at the same x-coordinate. In the average case, you'll probably do much better than this.

Categories