fitting curve: which model to describe distribution in weighted knowledge graphs - python

As a simple model to represent a knowledge network and learn about properties of weighted graphs, I computed the cosine similarity between Wikipedia articles.
I am looking now at the distribution of the similarity weights for each article (see pictures ).
In the pictures, you see that the curve changes derivative around a certain value (maybe from an exponential, to linear) : I would like to fit the curve and extract that value, where the derivate visibly (or expectedly) change, so that I can divide similar articles in two sets: the "most similar" (left side of the threshold) and the "others" (right side of the threshold).
I want to fit the curve for each article distribution; compare the distribution respect to the mean distribution of all the articles; compare the distribution respect to the distribution of a random weighted network.
(You're suggestions are most welcome in defining working procedure: you know I would like to use this model as a toy model to then train how a network, or an article, may evolve in time).
My background is User Experience with a twist for data science, I wish to comprehend better which model may describe the distribution of values I observed, a proper way to compare distributions, and python tools (or Mathematica 11) to fit the curve and obtain the derivative for each point.
which model do you suggest to describe distribution of observed values for similarity between objects in a weighted network (here, a collaborative knowledge base is represented as a weighted network, where weight is the similarity value of two given articles - should I expect an exponential? a poissonian ? why ?)
how to compute curve fit and extract derivative of the curve at a given point (python or Mathematica 11)

Working with Mathematica, suppose you data is in the list data. Then if you want to find the cubic polynomial that best fits your data, use the Fit function:
Fit[data, {1, x, x^2, x^3}, x]
In general the usage for the Fit command looks like
Fit["data set", "list of functions", "independent variable"]
where Mathematica tries to fit a linear combination of the functions in that list to your data set. I'm not sure what to say about what sort of curve we would expect this data to be best modeled by, but just remember that any smooth function can be approximated to arbitrary precision by a polynomial with sufficiently many terms. So if you have the computational power to spare, just let your list of functions be a long list of powers of x. Although it does look like you have an asymptote at x=0, so maybe allow there to be a 1/x term in there to capture that. And then of course you can use Plot to plot your curve on top of your data to compare them visually.
Now to get this best fit curve as a function in Mathematica that you can take a derivative of:
f[x_] := Fit[data, {1, x, x^2, x^3}, x]
And then the obvious change you are talking about occurs when the second derivative is zero, so to get that x value:
NSolve[f''[x] == 0, x]

Related

Python: How to discretize continuous probability distributions for Kullback-Leibler Divergence

I want to find out how many samples are needed at minimum to more or less correctly fit a probability distribution (In my case the Generalized Extreme Value Distribution from scipy.stats).
In order to evaluate the matched function, I want to compute the KL-divergence between the original function and the fitted one.
Unfortunately, all implementations I found (e.g. scipy.stats.entropy) only take discrete arrays as input. So obviously I thought of approximating the pdf by a discrete array, but I just can't seem to figure it out.
Does anyone have experience with something similar? I would be thankful for hints relating directly to my question, but also for better alternatives to determine a distance between two functions in python, if there are any.

how to separate two distributions from a pdf(probability density function)?

Assume the pdf(probability density function) of my dataset is as below, there are two distributions in the dataset and I want to fit them with a Gamma and a Gaussian distribution.
I have read this but this applies only for two Gaussian distribution.
How to use python to separate two gaussian curves?
Here is the steps that I would like to do
manually estimate the mean and variance of the Gaussian distribution
base on estimated mean and variance, create the pdf of the Gaussian distribution
substract the pdf of Gaussian from the original pdf
fit the pdf to a Gamma distribution
I am able to do 1~3, but for step 4 I do not know how to fit a Gamma distribution from a pdf
(not from data samples so stats.gamma.fit(data) does not work here).
Are the above steps reasonable for dealing with this kind of problem, and how to do step 4 in Python ? Appreciated for any help.
Interesting question. One issue I can see is that it will be sometimes difficult to disambiguate which mode is the Gamma and which is the Gaussian.
What I would perhaps do is try an expectation-maximization (EM) algorithm. Given the ambiguity stated above, I would do a few runs and select the best fit.
Perhaps, to speed things up a little, you could try EM with two possible starting points (somewhat similar to your idea):
estimate a mixture model of two Gaussians, say g0, g1.
run one EM to solve a mixture (Gamma, Gaussian), starting with an initial point that is (gamma_params_approx_gauss(g0), g1), where gamma_params_approx_gauss(g0) is a maximum-likelihood estimator of Gamma parameters given Gaussian parameters (see e.g. here).
run another EM, but starting with (gamma_params_approx_gauss(g1), g0).
select the best fit among the two EM results.

Frontier Equation - Fit a polynomial to the top of a data set

How can I fit a polynomial to an empirical data set using python such that it fits the "top" of the data -- i.e. for every value of x, the output of function is greater than the largest y at that x. But at the same time it minimizes this such that it hugs the data. An example of what I'm referring to is seen in the image below:
You need to use cvxopt to find the coordinates of the efficient frontier, which is a quadratic programming problem, then feed those coordinates in numpy's ployfit to get the polynomial fitting the frontier. This Quantopian blog post does both: https://blog.quantopian.com/markowitz-portfolio-optimization-2/

What exactly is the variance on the parameters of SciPy curve fit? (Python)

I'm currently using the curve_fit function of the scipy.optimize package in Python, and know that if you take the square root of the diagonal entries of the covariance matrix that you get from curve_fit, you get the standard deviation on the parameters that curve_fit calculated. What I'm not sure about, is what exactly this standard deviation means. It's an approximation using a Hesse matrix as far as I understand, but what would the exact calculation be? Standard deviation on the Gaussian Bell Curve tells you what percentage of area is within a certain range of the curve, so I assumed for curve_fit it tells you how many datapoints are between certain parameter values, but apparently that isn't right...
I'm sorry if this should be basic knowledge for curve fitting, but I really can't figure out what the standard deviations do, they express an error on the parameters, but those parameters are calculated as the best possible fit for the function, it's not like there's a whole collection of optimal parameters, and we get the average value of that collection and consequently also have a standard deviation. There's only one optimal value, what is there to compare it with? I guess my question really comes down to this: how can I manually and accurately calculate these standard deviations, and not just get an approximation using a Hesse matrix?
The variance in the fitted parameters represents the uncertainty in the best-fit value based on the quality of the fit of the model to the data. That is, it describes by how much the value could change away from the best-fit value and still have a fit that is almost as good as the best-fit value.
With standard definition of chi-square,
chi_square = ( ( (data - model)/epsilon )**2 ).sum()
and reduced_chi_square = chi_square / (ndata - nvarys) (where data is the array of the data values, model the array of the calculated model, epsilon is uncertainty in the data, ndata is the number of data points, and nvarys the number of variables), a good fit should have reduced_chi_square around 1 or chi_square around ndata-nvary. (Note: not 0 -- the fit will not be perfect as there is noise in the data).
The variance in the best-fit value for a variable gives the amount by which you can change the value (and re-optimize all other values) and increase chi-square by 1. That gives the so-called '1-sigma' value of the uncertainty.
As you say, these values are expressed in the diagonal terms of the covariance matrix returned by scipy.optimize.curve_fit (the off-diagonal terms give the correlations between variables: if a value for one variable is changed away from its optimal value, how would the others respond to make the fit better). This covariance matrix is built using the trial values and derivatives near the solution as the fit is being done -- it calculates the "curvature" of the parameter space (ie, how much chi-square changes when a variables value changes).
You can calculate these uncertainties by hand. The lmfit library (https://lmfit.github.io/lmfit-py/) has routines to more explicitly explore the confidence intervals of variables from least-squares minimization or curve-fitting. These are described in more detail at
https://lmfit.github.io/lmfit-py/confidence.html. It's probably easiest to use lmfit for the curve-fitting rather than trying to re-implement the confidence interval code for curve_fit.

Sampling methods

Can you help me out with these questions? I'm using Python
Sampling Methods
Sampling (or Monte Carlo) methods form a general and useful set of techniques that use random numbers to extract information about (multivariate) distributions and functions. In the context of statistical machine learning, we are most often concerned with drawing samples from distributions to obtain estimates of summary statistics such as the mean value of the distribution in question.
When we have access to a uniform (pseudo) random number generator on the unit interval (rand in Matlab or runif in R) then we can use the transformation sampling method described in Bishop Sec. 11.1.1 to draw samples from more complex distributions. Implement the transformation method for the exponential distribution
$$p(y) = \lambda \exp(−\lambda y) , y \geq 0$$
using the expressions given at the bottom of page 526 in Bishop: Slice sampling involves augmenting z with an additional variable u and then drawing samples from the joint (z,u) space.
The crucial point of sampling methods is how many samples are needed to obtain a reliable estimate of the quantity of interest. Let us say we are interested in estimating the mean, which is
$$\mu_y = 1/\lambda$$
in the above distribution, we then use the sample mean
$$b_y = \frac1L \sum^L_{\ell=1} y(\ell)$$
of the L samples as our estimator. Since we can generate as many samples of size L as we want, we can investigate how this estimate on average converges to the true mean. To do this properly we need to take the absolute difference
$$|\mu_y − b_y|$$
between the true mean $µ_y$ and estimate $b_y$
averaged over many, say 1000, repetitions for several values of $L$, say 10, 100, 1000.
Plot the expected absolute deviation as a function of $L$.
Can you plot some transformed value of expected absolute deviation to get a more or less straight line and what does this mean?
I'm new to this kind of statistical machine learning and really don't know how to implement it in Python. Can you help me out?
There are a few shortcuts you can take. Python has some built-in methods to do sampling, mainly in the Scipy library. I can recommend a manuscript that implements this idea in Python (disclaimer: I am the author), located here.
It is part of a larger book, but this isolated chapter deals with the more general Law of Large Numbers + convergence, which is what you are describing. The paper deals with Poisson random variables, but you should be able to adapt the code to your own situation.

Categories