Continuous mutual information in Python

Continuous mutual information in Python - python

[Frontmatter] (skip this if you just want the question):
I'm currently looking at using Shannon-Weaver Mutual Information and normalized redundancy to measure the degree of information masking between bags of discrete and continuous feature values, organized by feature. Using this method, it is my goal to construct an algorithm that looks very similar to ID3, but instead of using Shannon entropy, the algorithm will seek (as a loop constraint) to maximize or minimize shared information between a single feature and a collection of features based on the complete input feature space, adding new features to the latter collection if (and only if) they increase or decrease mutual information, respectively. This, in effect, moves ID3's decision algorithm into pairspace, stapling an ensemble approach to it with all of the expected time and space complexities of both methods.
[/Frontmatter]
On to the question: I'm trying to get a continuous integrator working in Python using SciPy. Because I'm working with comparisons of discrete and continuous variables, my current strategy for each comparison for feature-feature pairs is as follows:
Discrete feature versus discrete feature: use the discrete form of mutual information. This results in a double summation of the probabilities, which my code handles without issue.
All other cases (discrete versus continuous, the inverse, and continuous versus continuous): use the continuous form, using a Gaussian estimator to smooth out the probability density functions.
It is possible for me to perform some kind of discretization for the latter cases, but because my input data sets are not inherently linear, this is potentially needlessly complex.
Here's the salient code:
import math
import numpy
import scipy
from scipy.stats import gaussian_kde
from scipy.integrate import dblquad
# Constants
MIN_DOUBLE = 4.9406564584124654e-324
# The minimum size of a Float64; used here to prevent the
# logarithmic function from hitting its undefined region
# at its asymptote of 0.
INF = float('inf') # The floating-point representation for "infinity"
# x and y are previously defined as collections of
# floating point values with the same length
# Kernel estimation
gkde_x = gaussian_kde(x)
gkde_y = gaussian_kde(y)
if len(binned_x) != len(binned_y) and len(binned_x) != len(x):
x.append(x[0])
y.append(y[0])
gkde_xy = gaussian_kde([x,y])
mutual_info = lambda a,b: gkde_xy([a,b]) * \
math.log((gkde_xy([a,b]) / (gkde_x(a) * gkde_y(b))) + MIN_DOUBLE)
# Compute MI(X,Y)
(minfo_xy, err_xy) = \
dblquad(mutual_info, -INF, INF, lambda a: 0, lambda a: INF)
print 'minfo_xy = ', minfo_xy
Note that overcounting exactly one point is done deliberately to prevent a singularity in SciPy's gaussian_kde class. As the size of x and y mutually approach infinity, this effect becomes negligible.
My current snag is in trying to get multiple integration working against a Gaussian kernel density estimate in SciPy. I've been trying to use SciPy's dblquad to perform the integration, but in the latter case, I receive an astounding spew of the following messages.
When I set numpy.seterr ( all='ignore' ):
Warning: The ocurrence of roundoff error is detected, which prevents
the requested tolerance from being achieved. The error may be
underestimated.
And when I set it to 'call' using an error handler:
Floating point error (underflow), with flag 4
Floating point error (invalid value), with flag 8
Pretty easy to figure out what's going on, right? Well, almost: IEEE 754-2008 and SciPy only tell me what's going on here, not why or how to work around it.
The upshot: minfo_xy generally resolves to nan; its sampling is insufficient to prevent information from becoming lost or invalid when performing Float64 math.
Is there a general workaround for this problem when using SciPy?
Even better: if there is a robust, canned implementation of continuous mutual information for Python with an interface that takes two collections of floating point values or a merged collection of pairs, it would resolve this complete problem. Please link it if you know of one that exists.
Thank you in advance.
Edit: this resolves the nan propagation issue in the example above:
mutual_info = lambda a,b: gkde_xy([a,b]) * \
math.log((gkde_xy([a,b]) / ((gkde_x(a) * gkde_y(b)) + MIN_DOUBLE)) \
+ MIN_DOUBLE)
However, the question of roundoff correction remains, as does the request for a more robust implementation. Any help in either domain would be greatly appreciated.

Before trying more radical solutions like reframing the problem or using different integration tools, see if this helps. Replace INF=float('INF') with INF=1E12 or some other large number -- that may eliminate NaN results created by simple arithmetic operations on the input variables.
No promises on this one, but it is sometimes helpful to try a quick fix before engaging in a significant algorithmic rewrite or substitution of alternate tools.

Related

Binary variables for minimization by scipy differential evolution

I have a non-linear minimization problem that takes a combination of continuous and binary variables as input. Think of it as a network flow problem with valves, for which the throughput can be controlled, and with pumps, for which you can change the direction.
A "natural," minimalistic formulation could be:
arg( min( f(x1,y2,y3) )) s.t.
x1 \in [0,1] //a continuous variable
y2,y3 \in {0,1} //two binary variables
The objective function is deterministic, but expensive to solve. If I leave away the binary variables, Scipy's differential evolution algorithm turns out to be a useful solution approach for my problem (converging faster than basin hopping).
There is some evidence available already with regard to the inclusion of integer variables in a differential evolution-based minimization problem. The suggested approaches turn y2,y3 into continuous variables x2,x3 \in [0,1], and then modify the objective function as follows:
(i) f(x1, round(x2), round(x3))
(ii) f(x1,x2,x3) + K( (x2-round(x2))^2 + (x3-round(x3))^2 )
with K a tuning parameter
A third, and probably naive approach would be to combine the binary variables into a single continuous variable z \in [0,1], and thereby to reduce the number of optimization variables.
For instance,
if z<0.25: y2=y3=0
elif z<0.5: y2=1, y3=0
elif z<0.75: y2=0, y3=1
else: y2=y3=1.
Which one of the above should be preferred, and why? I'd be very curious to hear how binary variables can be integrated in a continuous differential evolution algorithm (such as Scipy's) in a smart way.
PS. I'm aware that there's some literature available that proposes dedicated mixed-integer evolutionary algorithms. For now, I'd like to stay with Scipy.

I'd be very curious to hear how binary variables can be integrated in a continuous differential evolution algorithm
wrapdisc is a package that is a thin wrapper which will let you optimize binary variables alongside floats with various scipy.optimize optimizers. There is a usage example in its readme. With it, you don't have to adapt your objective function at all.
As of v2.0.0, it has two possible encodings for binary:
ChoiceVar: This uses one-hot max encoding. Two floats are used to represent the binary variable.
GridVar: This uses rounding. One float is used to represent the binary variable.
Although neither of these two variable types were made for binary, they both can support it just the same. On average, GridVar requires fewer function evaluations because it uses one less float than ChoiceVar.

When scipy 1.9 is released the differential_evolution function will gain an integrality parameter that will allow the user to indicate which parameters should be considered as integers. For binary selection one would use bounds of (0,1) for an integer parameter.

scikit KernelPCA unstable results

I'm trying to use KernelPCA for reducing the dimensionality of a dataset to 2D (both for visualization purposes and for further data analysis).
I experimented computing KernelPCA using a RBF kernel at various values of Gamma, but the result is unstable:
(each frame is a slightly different value of Gamma, where Gamma is varying continuously from 0 to 1)
Looks like it is not deterministic.
Is there a way to stabilize it/make it deterministic?
Code used to generate transformed data:
def pca(X, gamma1):
kpca = KernelPCA(kernel="rbf", fit_inverse_transform=True, gamma=gamma1)
X_kpca = kpca.fit_transform(X)
#X_back = kpca.inverse_transform(X_kpca)
return X_kpca

KernelPCA should be deterministic and evolve continuously with gamma. It is different from RBFSampler that does have built-in randomness in order to provide an efficient (more scalable) approximation of the RBF kernel.
However what can change in KernelPCA is the order of the principal components: in scikit-learn they are returned sorted in order of descending eigenvalue, so if you have 2 eigenvalues close to each other it could be that the order changes with gamma.
My guess (from the gif) is that this is what is happening here: the axes along which you are plotting are not constant so your data seems to jump around.
Could you provide the code you used to produce the gif?
I'm guessing it is a plot of the data points along the 2 first principal components but it would help to see how you produced it.
You could try to further inspect it by looking at the values of kpca.alphas_ (the eigenvectors) for each value of gamma.
Hope this makes sense.
EDIT: As you noted it looks like the points are reflected against the axis, the most plausible explanation is that one of the eigenvector flips sign (note this does not affect the eigenvalue).
I put in a simple gist to reproduce the issue (you'll need a Jupyter notebook to run it). You can see the sign-flipping when you change the value of gamma.
As a complement note that this kind of discrepancy happens only because you fit several times the KernelPCA object several times. Once you settled with a particular gamma value and you've fit kpca once you can call transform several times and get consistent results.
For the classical PCA the docs mention that:
Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion.
I don't know about the behavior of a single KernelPCA object that you would fit several times (I did not find anything relevant in the docs).
It does not apply to your case though as you have to fit the object with several gamma values.

So... I can't give you a definitive answer on why KernelPCA is not deterministic. The behavior resembles the differences I've observed between the results of PCA and RandomizedPCA. PCA is deterministic, but RandomizedPCA is not, and sometimes the eigenvectors are flipped in sign relative to the PCA eigenvectors.
That leads me to my vague idea of how you might get more deterministic results....maybe. Use RBFSampler with a fixed seed:
def pca(X, gamma1):
kernvals = RBFSampler(gamma=gamma1, random_state=0).fit_transform(X)
kpca = PCA().fit_transform(X)
X_kpca = kpca.fit_transform(X)
return X_kpca

How do you compute the expected value of an infinite distribution? (particularly in python)

I was trying to compute the expected value of a distribution (assume I know the parameters or I can estimate them) but it might be a distribution over a sample space that is infinite. Is there a library (for example in python, numpy or something) that is able to compute such an expected value with reasonable speed and accuracy?
For an arbitrary distribution it seemed hard, but the only thoughts I had was, if it was normally, then we can approximate this by adding small enough chunks in cap where the probability is highly concentrated or something...but I wanted to do something less ad-hoc and more established, since I am sure I am not the first one to try to compute an expected value in a computer.

Having a probability space with infinite support is not uncommon.
The normal or t distribution have support over the real line, the Poisson distribution is over all positive integers.
The distribution in scipy.stats implement an expect method, which in the continuous case just uses scipy.integrate.quad, and in the discrete case uses expanding summation with some heuristic stopping criterion.
This works quite well with well behaved functions but can run into problems in some cases, like shifted support of the function or fat tails.
variance of standard normal:
>>> from scipy import stats
>>> stats.norm.expect(lambda x: x**2)
1.000000000000001
variance of Poisson:
>>> stats.poisson.expect(lambda x: (x - 5)**2, args=(5,))
4.9999999999999973

Maximum Likelihood Estimate pseudocode

I need to code a Maximum Likelihood Estimator to estimate the mean and variance of some toy data. I have a vector with 100 samples, created with numpy.random.randn(100). The data should have zero mean and unit variance Gaussian distribution.
I checked Wikipedia and some extra sources, but I am a little bit confused since I don't have a statistics background.
Is there any pseudo code for a maximum likelihood estimator? I get the intuition of MLE but I cannot figure out where to start coding.
Wiki says taking argmax of log-likelihood. What I understand is: I need to calculate log-likelihood by using different parameters and then I'll take the parameters which gave the maximum probability. What I don't get is: where will I find the parameters in the first place? If I randomly try different mean & variance to get a high probability, when should I stop trying?

I just came across this, and I know its old, but I'm hoping that someone else benefits from this. Although the previous comments gave pretty good descriptions of what ML optimization is, no one gave pseudo-code to implement it. Python has a minimizer in Scipy that will do this. Here's pseudo code for a linear regression.
# import the packages
import numpy as np
from scipy.optimize import minimize
import scipy.stats as stats
import time
# Set up your x values
x = np.linspace(0, 100, num=100)
# Set up your observed y values with a known slope (2.4), intercept (5), and sd (4)
yObs = 5 + 2.4*x + np.random.normal(0, 4, 100)
# Define the likelihood function where params is a list of initial parameter estimates
def regressLL(params):
# Resave the initial parameter guesses
b0 = params[0]
b1 = params[1]
sd = params[2]
# Calculate the predicted values from the initial parameter guesses
yPred = b0 + b1*x
# Calculate the negative log-likelihood as the negative sum of the log of a normal
# PDF where the observed values are normally distributed around the mean (yPred)
# with a standard deviation of sd
logLik = -np.sum( stats.norm.logpdf(yObs, loc=yPred, scale=sd) )
# Tell the function to return the NLL (this is what will be minimized)
return(logLik)
# Make a list of initial parameter guesses (b0, b1, sd)
initParams = [1, 1, 1]
# Run the minimizer
results = minimize(regressLL, initParams, method='nelder-mead')
# Print the results. They should be really close to your actual values
print results.x
This works great for me. Granted, this is just the basics. It doesn't profile or give CIs on the parameter estimates, but its a start. You can also use ML techniques to find estimates for, say, ODEs and other models, as I describe here.
I know this question was old, hopefully you've figured it out since then, but hopefully someone else will benefit.

If you do maximum likelihood calculations, the first step you need to take is the following: Assume a distribution that depends on some parameters. Since you generate your data (you even know your parameters), you "tell" your program to assume Gaussian distribution. However, you don't tell your program your parameters (0 and 1), but you leave them unknown a priori and compute them afterwards.
Now, you have your sample vector (let's call it x, its elements are x[0] to x[100]) and you have to process it. To do so, you have to compute the following (f denotes the probability density function of the Gaussian distribution):
f(x[0]) * ... * f(x[100])
As you can see in my given link, f employs two parameters (the greek letters µ and σ). You now have to calculate the values for µ and σ in a way such that f(x[0]) * ... * f(x[100]) takes the maximum possible value.
When you've done that, µ is your maximum likelihood value for the mean, and σ is the maximum likelihood value for standard deviation.
Note that I don't explicitly tell you how to compute the values for µ and σ, since this is a quite mathematical procedure I don't have at hand (and probably I would not understand it); I just tell you the technique to get the values, which can be applied to any other distributions as well.
Since you want to maximize the original term, you can "simply" maximize the logarithm of the original term - this saves you from dealing with all these products, and transforms the original term into a sum with some summands.
If you really want to calculate it, you can do some simplifications that lead to the following term (hope I didn't mess up anything):
Now, you have to find values for µ and σ such that the above beast is maximal. Doing that is a very nontrivial task called nonlinear optimization.
One simplification you could try is the following: Fix one parameter and try to calculate the other. This saves you from dealing with two variables at the same time.

You need a numerical optimisation procedure. Not sure if anything is implemented in Python, but if it is then it'll be in numpy or scipy and friends.
Look for things like 'the Nelder-Mead algorithm', or 'BFGS'. If all else fails, use Rpy and call the R function 'optim()'.
These functions work by searching the function space and trying to work out where the maximum is. Imagine trying to find the top of a hill in fog. You might just try always heading up the steepest way. Or you could send some friends off with radios and GPS units and do a bit of surveying. Either method could lead you to a false summit, so you often need to do this a few times, starting from different points. Otherwise you may think the south summit is the highest when there's a massive north summit overshadowing it.

As joran said, the maximum likelihood estimates for the normal distribution can be calculated analytically. The answers are found by finding the partial derivatives of the log-likelihood function with respect to the parameters, setting each to zero, and then solving both equations simultaneously.
In the case of the normal distribution you would derive the log-likelihood with respect to the mean (mu) and then deriving with respect to the variance (sigma^2) to get two equations both equal to zero. After solving the equations for mu and sigma^2, you'll get the sample mean and sample variance as your answers.
See the wikipedia page for more details.

Python: Plotting a power law function with exponential cutoff

I have a graph between 2 functions f and g.
I know it follows a power law function with exponential cutoff.
f(x) = x**(-alpha)*e**(-lambda*x)
How do I find the value of exponent alpha?

If you have sufficiently close x points (for example one every 0.1), you can try the following:
ln(f(x)) = -alpha ln(x) - lambda x
ln(f(x))' = - alpha / x - lambda
So depending on where you have your points:
If you have a lot of points near 0, you can try:
h(x) = x ln(f(x))' = -alpha - lambda x
So the limit of the function h when x goes to 0 is -alpha
If you have large values of x, the function x -> ln(f(x))' tends toward lambda when x goes to infinity, so you can guess lambda and use pwdyson's expression.
If you don't have close x points, the numerical derivative will be very noisy, so I would try to guess lambda as the limit of -ln(f(x)/x for large x's...
If you don't have large values, but a large number of x's, you can try a minimization of
sum_x_i (ln(y_i) + alpha ln(x_i) + lambda x_i) ^2
on both alpha and lambda (I guess It would be more precise than the initial expression)...
It is a simple least square regression (numpy.linalg.lstsq will do the job).
So you have plenty of methods, the one to chose really depends on you inputs.

The usual and general way of doing what you want is to perform a non-linear regression (even though, as noted in another response, it is possible to linearize the problem). Python can do this quite easily with the help of the SciPy package, which is used by many scientists.
The routine you are looking for is its least-square optimization routine (scipy.optimize.leastsq). Once you wrap your head around the way this general optimization procedure works (see the example), you will probably find many other opportunities to use it. Basically, you calculate the list of differences between your measurements and their ideal value f(x), and you ask SciPy to find the parameters that make these differences as small as possible, so that your data fits the model as well as possible. This then gives you the parameter you are looking for.

It sounds like you might be trying to fit a power-law to a distribution with an exponential cutoff at the low end due to incompleteness - but I may be reading too far into your problem.
If that is the problem you're dealing with, this website (and accompanying publication) addresses the issue: http://tuvalu.santafe.edu/~aaronc/powerlaws/. I wrote the python implementation of the power-law fitter on that page; it is linked from there.

If you know that the points follow this law exactly, then invert the equation and put in an x and its corresponding f(x) value:
import math
alpha = -(lambda*x + math.log(f(x)))/math.log(x)
But the if the points do not exactly fit the equation you will need to do some sort of regression to determine alpha.
EDIT: Ok, so they don't fit exactly. This is getting beyond a Python question, but there may be something in numpy that can handle it. Here is a numpy linear regression recipe but your equation can't be rearranged into a linear form, so you'll have to look into non-linear regression.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.