Python Hellinger formula explanation

Python Hellinger formula explanation - python

I was looking up some formulas for Hellinger's distance between distributions, and I found one (in Python) that I've never seen similar format for. I am confused how it works.
def hellinger(p,q):
"""Hellinger distance between distributions"""
return sum([(sqrt(t[0])-sqrt(t[1]))*(sqrt(t[0])-sqrt(t[1]))\
for t in zip(p,q)])/sqrt(2.)
I've never seen this kind of... format before. They are dividing by a for statement? I mean.. how does this even work?

I have a faible for distance measures, hence I made a notebook with some implementations of Hellinger distance.
Regarding your question, the construct is called a list comrehension and the backslash is just for line continuation.
Here is a possible listing without list comprehension:
def hellinger_explicit(p, q):
"""Hellinger distance between two discrete distributions.
Same as original version but without list comprehension
"""
list_of_squares = []
for p_i, q_i in zip(p, q):
# caluclate the square of the difference of ith distr elements
s = (math.sqrt(p_i) - math.sqrt(q_i)) ** 2
# append
list_of_squares.append(s)
# calculate sum of squares
sosq = sum(list_of_squares)
return sosq / math.sqrt(2)

Related

How to iterate through the Cartesian product of ten lists (ten elements each) faster? (Probability and Dice)

I'm trying to solve this task.
I wrote function for this purpose which uses itertools.product() for Cartesian product of input iterables:
def probability(dice_number, sides, target):
from itertools import product
from decimal import Decimal
FOUR_PLACES = Decimal('0.0001')
total_number_of_experiment_outcomes = sides ** dice_number
target_hits = 0
sides_combinations = product(range(1, sides+1), repeat=dice_number)
for side_combination in sides_combinations:
if sum(side_combination) == target:
target_hits += 1
p = Decimal(str(target_hits / total_number_of_experiment_outcomes)).quantize(FOUR_PLACES)
return float(p)
When calling probability(2, 6, 3) output is 0.0556, so works fine.
But calling probability(10, 10, 50) calculates veeery long (hours?), but there must be a better way:)
for side_combination in sides_combinations: takes to long to iterate through huge number of sides_combinations.
Please, can you help me to find out how to speed up calculation of result, i want too sleep tonight..

I guess the problem is to find the distribution of the sum of dice. An efficient way to do that is via discrete convolution. The distribution of the sum of variables is the convolution of their probability mass functions (or densities, in the continuous case). Convolution is an n-ary operator, so you can compute it conveniently just two pmf's at a time (the current distribution of the total so far, and the next one in the list). Then from the final result, you can read off the probabilities for each possible total. The first element in the result is the probability of the smallest possible total, and the last element is the probability of the largest possible total. In between you can figure out which one corresponds to the particular sum you're looking for.
The hard part of this is the convolution, so work on that first. It's just a simple summation, but it's just a little tricky to get the limits of the summation correct. My advice is to work with integers or rationals so you can do exact arithmetic.
After that you just need to construct an appropriate pmf for each input die. The input is just [1, 1, 1, ... 1] if you're using integers (you'll have to normalize eventually) or [1/n, 1/n, 1/n, ..., 1/n] if rationals, where n = number of faces. Also you'll need to label the indices of the output correctly -- again this is just a little tricky to get it right.
Convolution is a very general approach for summations of variables. It can be made even more efficient by implementing convolution via the fast Fourier transform, since FFT(conv(A, B)) = FFT(A) FFT(B). But at this point I don't think you need to worry about that.

If someone still interested in solution which avoids very-very-very long iteration process through all itertools.product Cartesian products, here it is:
def probability(dice_number, sides, target):
if dice_number == 1:
return (1 <= target <= sides**dice_number) / sides
return sum([probability(dice_number-1, sides, target-x) \
for x in range(1,sides+1)]) / sides
But you should add caching of probability function results, if you won't - calculation of probability will takes very-very-very long time as well)
P.S. this code is 100% not mine, i took it from the internet, i'm not such smart to product it by myself, hope you'll enjoy it as much as i.

Python: Randomly draw several objects in a list

I am looking for the most efficient way to randomly draw nelements in a list given a list of probabilities stating the probability of each element to be picked.
aList = [3,4,2,1,4,3,5,7,6,4]
MyProba = [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]
It means that at each draw, the first element (which is 3) has a probability of 0.1 to be drawn. Of course,
sum(MyProba) == 1 # always returns True
len(aList) == len(MyProba) # always returns True
Up to now I did the following:
def random_pick(some_list, proba):
x = random.uniform(0, 1)
cumulative_proba = 0.0
for item, item_proba in zip(some_list, proba):
cumulative_proba += item_proba
if x < cumulative_proba:
break
return item
nb_draws = 10
list_of_drawn_elements = []
for one_draw in range(nb_draws):
list_of_drawn_elements.append(random_pick(aList, MyProba))
It works but it is terribly slow for long lists and big values of nb_draws. How can I improve the speed of this process?
Note: In the special case I am facing, nb_draws always equals the length of aList.

The general idea (as outlined by others' answers as well) is that your method is inefficient because the preprocessing (the calculation of the cumulative distribution) is done every time you draw a sample, although it would be enough to do it once before the sampling and then use the preprocessed data to do the sampling.
The preprocessing and sampling could be done efficiently with Walker's alias method. I have implemented it a while ago; take a look at the source code. (Sorry for the external link, but I think it's too long to post it here). My version requires NumPy; if you don't want to use NumPy, there is a NumPy-free alternative as well (on which my version is based).
Edit: the explanation of Walker's alias method is to be found in the first link I provided. In a nutshell, imagine that you somehow managed to construct a rectangular "darts board" that is subdivided into parts such that each part corresponds to one of your original items, and the area of each part is proportional to the desired probability of selecting the corresponding element. You can then start throwing darts at random at the darts board (by generating two random numbers that specify the horizontal and vertical coordinate of where the dart ended up) and check which areas the darts hit. The items corresponding to the areas will be the items you have selected. Walker's alias method is simply a linear-time preprocessing that constructs the dart board. Drawing each element can then be done in constant time. In the end, drawing m elements out of n will have a cost of O(n) for preprocessing and O(m) for generating the samples, yielding a total complexity of O(n + m).

here's my lazy method... build a list with expected number of values for the desired distribution, and use random.choice() to pick a value from the list.
>>> import random
>>>
>>> value_probs = dict(zip([3,4,2,1,4,3,5,7,6,4], [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]))
>>> expected_dist = sum([[i] * int(prob * 100) for i, prob in value_probs.iteritems()], [])
>>> random.choice(expected_dist)

You might try to precalculate the cumulative probability range for each element and make a tree from these intervals. Then you will get a logarithmic complexity for looking up the element corresponding to the generated probability, instead of linear one that you have now.

You're calculating cumulative_proba each time when you call random_pick. I suggest to calculate it outside the method, and use a better data structure to store it, like a binary search tree, which will reduce the time complexity from O(n) to O(lgn).

Interpolation of sin(x) using Python

I am working on a homework problem for which I am supposed to make a function that interpolates sin(x) for n+1 interpolation points and compares the interpolation to the actual values of sin at those points. The problem statement asks for a function Lagrangian(x,points) that accomplishes this, although my current attempt at executing it does not use 'x' and 'points' in the loops, so I think I will have to try again (especially since my code doesn't work as is!) However, why I can't I access the items in the x_n array with an index, like x_n[k]? Additionally, is there a way to only access the 'x' values in the points array and loop over those for L_x? Finally, I think my 'error' definition is wrong, since it should also be an array of values. Is it necessary to make another for loop to compare each value in the 'error' array to 'max_error'? This is my code right now (we are executing in a GUI our professor made, so I think some of the commands are unique to that such as messages.write()):
def problem_6_run(problem_6_n, problem_6_m, plot, messages, **kwargs):
n = problem_6_n.value
m = problem_6_m.value
messages.write('\n=== PROBLEM 6 ==========================\n')
x_n = np.linspace(0,2*math.pi,n+1)
y_n = np.sin(x_n)
points = np.column_stack((x_n,y_n))
i = 0
k = 1
L_x = 1.0
def Lagrange(x, points):
for i in n+1:
for k in n+1:
return L_x = (x- x_n[k] / x_n[i] - x_n[k])
return Lagrange = y_n[i] * L_x
error = np.sin(x) - Lagrange
max_error = 0
if error > max_error:
max_error = error
print.messages('Maximum error = &g' % max_error)
plot.draw_lines(n+1,np.sin(x))
plot.draw_points(m,Lagrange)
plots.draw_points(m,error)
Edited:
Yes, the different things ThiefMaster mentioned are part of my (non CS) professor's environment; and yes, voithos, I'm using numpy and at this point have definitely had more practice with Matlab than Python (I guess that's obvious!). n and m are values entered by the user in the GUI; n+1 is the number of interpolation points and m is the number of points you plot against later.
Pseudocode:
Given n and m
Generate x_n a list of n evenly spaced points from 0 to 2*pi
Generate y_n a corresponding list of points for sin(x_n)
Define points, a 2D array consisting of these ordered pairs
Define Lagrange, a function of x and points
for each value in the range n+1 (this is where I would like to use points but don't know how to access those values appropriately)
evaluate y_n * (x - x_n[later index] / x_n[earlier index] - x_n[later index])
Calculate max error
Calculate error interpolation Lagrange - sin(x)
plot sin(x); plot Lagrange; plot error
Does that make sense?

Some suggestions:
You can access items in x_n via x_n[k] (to answer your question).
Your loops for i in n+1: and for k in n+1: only execute once each, one with i=n+1 and one with k=n+1. You need to use for i in range(n+1) (or xrange) to get the whole list of values [0,1,2,...,n].
in error = np.sin(x) - Lagrange: You haven't defined x anywhere, so this will probably result in an error. Did you mean for this to be within the Lagrange function? Also, you're subtracting a function (Lagrange) from a number np.sin(x), which isn't going to end well.
When you use the return statement in your def Lagrange you are exiting your function. So your loop will never loop more than once because you're returning out of the function. I think you might actually want to store those values instead of returning them.
Can you write some pseudocode to show what you'd like to do? e.g.:
Given a set of points `xs` and "interpolated" points `ys`:
For each point (x,y) in (xs,ys):
Calculate `sin(x)`
Calculate `sin(x)-y` being the difference between the function and y
.... etc etc
This will make the actual code easier for you to write, and easier for us to help you with (especially if you intellectually understand what you're trying to do, and the only problem is with converting that into python).
So : try fix up some of these points in your code, and try write some pseudocode to say what you want to do, and we'll keep helping you :)

Efficiently determine "how sorted" a list is, eg. Levenshtein distance

I'm doing some research on ranking algorithms, and would like to, given a sorted list and some permutation of that list, calculate some distance between the two permutations. For the case of the Levenshtein distance, this corresponds to calculating the distance between a sequence and a sorted copy of that sequence. There is also, for instance, the "inversion distance", a linear-time algorithm of which is detailed here, which I am working on implementing.
Does anyone know of an existing python implementation of the inversion distance, and/or an optimization of the Levenshtein distance? I'm calculating this on a sequence of around 50,000 to 200,000 elements, so O(n^2) is far too slow, but O(n log(n)) or better should be sufficient.
Other metrics for permutation similarity would also be appreciated.
Edit for people from the future:
Based on Raymond Hettinger's response; it's not Levenshtein or inversion distance, but rather "gestalt pattern matching" :P
from difflib import SequenceMatcher
import random
ratings = [random.gauss(1200, 200) for i in range(100000)]
SequenceMatcher(None, ratings, sorted(ratings)).ratio()
runs in ~6 seconds on a terrible desktop.
Edit2: If you can coerce your sequence into a permutation of [1 .. n], then a variation of the Manhattan metric is extremely fast and has some interesting results.
manhattan = lambda l: sum(abs(a - i) for i, a in enumerate(l)) / (0.5 * len(l) ** 2)
rankings = list(range(100000))
random.shuffle(rankings)
manhattan(rankings) # ~ 0.6665, < 1 second
The normalization factor is technically an approximation; it is correct for even sized lists, but should be (0.5 * (len(l) ** 2 - 1)) for odd sized lists.
Edit3: There are several other algorithms for checking list similarity! The Kendall Tau ranking coefficient and the Spearman ranking coefficient. Implementations of these are available in the SciPy library as scipy.stats.kendalltau and scipy.stats.rspearman, and will return the ranks along with the associated p-values.

Levenshtein distance is an O(n**2) algorithm, so if you want to go faster, use the alternative fast algorithm in the difflib module. The ratio method computes a measure of similarity between two sequences.
If you have to stick with Levenshtein, there is a Python recipe for it on the ASPN Python Cookbook: http://code.activestate.com/recipes/576874-levenshtein-distance/ .
Another Python script can be found at: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python

Euclidian Distance Python Implementation

I am playing with the following code from programming collective intelligence, this is a function from the book that calculated eclidian distance between two movie critics.
This function sums the difference of the rankings in the dictionary, but euclidean distance in n dimensions also includes the square root of that sum.
AFAIK since we use the same function to rank everyone it does not matter we square root or not, but i was wondering is there a particular reason for that?
from math import sqrt
# Returns a distance-based similarity score for person1 and person2
def sim_distance(prefs,person1,person2):
# Get the list of shared_items
si={}
for item in prefs[person1]:
if item in prefs[person2]:
si[item]=1
# if they have no ratings in common, return 0
if len(si)==0: return 0
# Add up the squares of all the differences
sum_of_squares=sum([pow(prefs[person1][item]-prefs[person2][item],2)
for item in prefs[person1] if item in prefs[person2]])
return 1/(1+sum_of_squares)

The reason the square root is not used is because it is computationally expensive; it is monotonic (i.e., it preserves order) with the square function, so if all you're interested in is the order of the distances, the square root is unnecessary (and, as mentioned, very expensive computationally).

That's correct. While the square root is necessary for a quantitatively correct result, if all you care about is distance relative to others for sorting, then taking the square root is superfluous.

To compute a Cartesian distance, first you must compute the distance-squared, then you take its square root. But computing a square root is computationally expensive. If all you're really interested in is comparing distances, it works just as well to compare the distance-squared--and it's much faster.
For every two real numbers A and B, where A and B are >= zero, it's always true that A-squared and B-squared have the same relationship as A and B:
if A < B, then A-squared < B-squared.
if A == B, then A-squared == B-squared.
if A > B, then A-squared > B-squared.
Since distances are always >= 0 this relationship means comparing distance-squared gives you the same answer as comparing distance.

Just for intercomparisons the square root is not necessary and you would get the squared euclidean distance... which is also a distance (mathematically speaking, see http://en.wikipedia.org/wiki/Metric_%28mathematics%29).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.