Covariance and correlation coefficient

Covariance and correlation coefficient - python

I have two random variables and I need to calculate precisely some characteristics for them.
https://math.stackexchange.com/questions/3052308/calculated-covariance-corr-coefficient-confirmation?noredirect=1#
I already did this in Java but I want to confirm my answers with at least one more tool.
Could anyone good at python / probability provide me with some guidance how I can calculate these 6 values in python? I guess it is really simple but I am not very confident in python.
I looked at the documentation of the numpy cov function but I have difficulty to understand it.

The best solution is to use the functions from numpy:
import numpy as np
e_X = np.average(X_values, weights=X_weights)
e_Y = np.average(Y_values, weights=Y_weights)
varX = np.average((X_values-e_X)**2, weights=X_weights)
varY = np.average((Y_values-e_Y)**2, weights=Y_weights)
cov_XY = np.cov(X_values, Y_values)
corrcoef_XY = np.corrcoef(X_values, Y_values)

Related

Probability Jaccard Similarity in numpy

Anyone has any idea how to efficiently implement a 2D probability Jaccard similarity algorithm in numpy? It looks like this specific algorithm is almost non-existent in computer vision (not in pytorch, not in tensorflow nor in skilearn, I wonder is there a specific reason for this). The formula for probability Jaccard similarity is (taken from wikipedia):

This is one way of doing it. It's pretty straightforward, we use broadcasting to perform the divisions of all pair of points without loops:
def jaccard_probability(x,y):
# Ignore == 0 terms
x0 = x[x!=0]
y0 = y[y!=0]
jac = np.sum(
1.0 / np.sum(np.maximum(x0[:,None] / x0, y0[:,None] / y0), axis=0)
)
return jac
However, I suggest you read the NumPy guide to get a grasp of the basics, at least of broadcasting, as it is a very useful tool to know if you plan on using NumPy in the future and want to make efficient code!

How to calculate uncertainty of y values generated by fit functions

I am using a fit function to calculate values used by an application in a manner similar to below:
import numpy as np
from numpy import random
x = range(10)
y = random.standard_normal(10)
w = random.standard_normal(10)/10
w = 1/w
p,cov = np.polynomial.polynomial.polyfit(x=x,y=y,deg=1,w=w,full=True)
fun = np.polynomial.polynomial.Polynomial(p)
new_x = 20
new_y = fun(new_x)
#y_1_sigma_uncertainty = ???
Is there a way to use the covariance matrix to calculate an uncertainty associated with values calculated by fun? Is there another way to go about this? I have done quite a bit of searching, but I am probably not asking the question correctly. I am not a stats person so I am hoping my example is useful in clarifying what I am trying to ask.
Thanks,
gl

Is there a method to do arithmetic with SciPy's random variables?

SciPy's stats module have objects of the type "random variable" (they call it rv_frozen). It makes it easy to plot, say, cdf's of random variables of a given distribution. Here's a very simple example:
import scipy.stats as stats
n = stats.norm()
x = linspace(-3, 3)
y = n.cdf(x)
plot(x, y)
I wondered whether there's a way of doing basic arithmetic manipulations on such random variables. The following example is a wishful thinking (it doesn't work).
du_list = [stats.randint(2, 5) for _ in xrange(100)]
du_avg = sum(du_list) / len(du_list)
x = linspace(0, 10)
y = du_avg.cdf(x)
plot(x, y)
This wishful-thinking example should produce the graph of the cumulative distribution function of the random variable which is the average of 100 i.i.d. random variables, each is distributed uniformly on the set {2,3,4}.

I realize this is a bit late, but I figured I'd answer in case anyone else needs this in the future. I needed the same functionality recently and even thought about extending scipy's rv_discrete to implement this, but then I found PaCAL.
PaCAL is a Python software package for doing arithmetic on random variables and it supports quite a few distributions, including continuous distributions. There is even some support for bivariate joint distributions. Available as a package on PyPI. Only for Python 2.x though.
EDIT: The PaCAL repo on Github now supports Python 3.x as well.

The method that matches your description exactly doesn't exist. The cdf of different distributions are all defined int the **/scipy/stats/distributions.py` source file. For example:
Boltzman distribution CDF (Line 7675):
def _cdf(self, x, lambda_, N):
k = floor(x)
return (1-exp(-lambda_*(k+1)))/(1-exp(-lambda_*N))
You can, estimate the MLE and then call the cdf method, see this sample:
import scipy.stats as ss
unknown=np.random.normal(loc=1.1, scale=2.0, size=100)
Loc, Scale=ss.norm.fit_loc_scale(unknown) #making a MLE fit
unknown_cdf=lambda x: ss.norm.cdf(x, loc=Loc, scale=Scale) #the cdf of the MLE to the data
plt.plot(np.linspace(-10, 10), unknown_cdf(np.linspace(-10, 10)), '-')

You could compute it by hand.
With X the random variable created as the sum of the Xi, i random variables uniformly distributed U(2,5). Sample your distribution created by X to obtain the pdf, and integrate to obtain the cdf.
Or you could try to find an analytical solution for this problem.
See the Irwin-Hall distribution
and the related discussion on Math-stackexchange.

How to perform a chi-squared goodness of fit test using scientific libraries in Python?

Let's assume I have some data I obtained empirically:
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
It is exponentially distributed (with some noise) and I want to verify this using a chi-squared goodness of fit (GoF) test. What is the simplest way of doing this using the standard scientific libraries in Python (e.g. scipy or statsmodels) with the least amount of manual steps and assumptions?
I can fit a model with:
param = stats.expon.fit(x)
plt.hist(x, normed=True, color='white', hatch='/')
plt.plot(grid, distr.pdf(np.linspace(0, 100, 10000), *param))
It is very elegant to calculate the Kolmogorov-Smirnov test.
>>> stats.kstest(x, lambda x : stats.expon.cdf(x, *param))
(0.0061000000000000004, 0.85077099515985011)
However, I can't find a good way of calculating the chi-squared test.
There is a chi-squared GoF function in statsmodel, but it assumes a discrete distribution (and the exponential distribution is continuous).
The official scipy.stats tutorial only covers a case for a custom distribution and probabilities are built by fiddling with many expressions (npoints, npointsh, nbound, normbound), so it's not quite clear to me how to do it for other distributions. The chisquare examples assume the expected values and DoF are already obtained.
Also, I am not looking for a way to "manually" perform the test as was already discussed here, but would like to know how to apply one of the available library functions.

An approximate solution for equal probability bins:
Estimate the parameters of the distribution
Use the inverse cdf, ppf if it's a scipy.stats.distribution, to get the binedges for a regular probability grid, e.g. distribution.ppf(np.linspace(0, 1, n_bins + 1), *args)
Then, use np.histogram to count the number of observations in each bin
then use chisquare test on the frequencies.
An alternative would be to find the bin edges from the percentiles of the sorted data, and use the cdf to find the actual probabilities.
This is only approximate, since the theory for the chisquare test assumes that the parameters are estimated by maximum likelihood on the binned data. And I'm not sure whether the selection of binedges based on the data affects the asymptotic distribution.
I haven't looked into this into a long time.
If an approximate solution is not good enough, then I would recommend that you ask the question on stats.stackexchange.

Why do you need to "verify" that it's exponential? Are you sure you need a statistical test? I can pretty much guarantee that is isn't ultimately exponential & the test would be significant if you had enough data, making the logic of using the test rather forced. It may help you to read this CV thread: Is normality testing 'essentially useless'?, or my answer here: Testing for heteroscedasticity with many observations.
It is typically better to use a qq-plot and/or pp-plot (depending on whether you are concerned about the fit in the tails or middle of the distribution, see my answer here: PP-plots vs. QQ-plots). Information on how to make qq-plots in Python SciPy can be found in this SO thread: Quantile-Quantile plot using SciPy

I tried you problem with OpenTURNS.
Beginning is the same:
import numpy as np
from scipy import stats
size = 10000
x = 10 * stats.expon.rvs(size=size) + 0.2 * np.random.uniform(size=size)
If you suspect that your sample x is coming from an Exponential distribution, you can use ot.ExponentialFactory() to fit the parameters:
import openturns as ot
sample = ot.Sample([[p] for p in x])
distribution = ot.ExponentialFactory().build(sample)
As Factory needs a an ot.Sample() as input, I needed format x and reshape it as 10.000 points of dimension 1.
Let's now assess this fitting using ChiSquared test:
result = ot.FittingTest.ChiSquared(sample, distribution, 0.01)
print('Exponential?', result.getBinaryQualityMeasure(), ', P-value=', result.getPValue())
>>> Exponential? True , P-value= 0.9275212544642293
Very good!
And of course, print(distribution) will give you the fitted parameters:
>>> Exponential(lambda = 0.0982391, gamma = 0.0274607)

Multiple linear regression in python without fitting the origin?

I found this chunk of code on http://rosettacode.org/wiki/Multiple_regression#Python, which does a multiple linear regression in python. Print b in the following code gives you the coefficients of x1, ..., xN. However, this code is fitting the line through the origin (i.e. the resulting model does not include a constant).
All I'd like to do is the exact same thing except I do not want to fit the line through the origin, I need the constant in my resulting model.
Any idea if it's a small modification to do this? I've searched and found numerous documents on multiple regressions in python, except they are lengthy and overly complicated for what I need. This code works perfect, except I just need a model that fits through the intercept not the origin.
import numpy as np
from numpy.random import random
n=100
k=10
y = np.mat(random((1,n)))
X = np.mat(random((k,n)))
b = y * X.T * np.linalg.inv(X*X.T)
print(b)
Any help would be appreciated. Thanks.

you only need to add a row to X that is all 1.

Maybe a more stable approach would be to use a least squares algorithm anyway. This can also be done in numpy in a few lines. Read the documentation about numpy.linalg.lstsq.
Here you can find an example implementation:
http://glowingpython.blogspot.de/2012/03/linear-regression-with-numpy.html

What you have written out, b = y * X.T * np.linalg.inv(X * X.T), is the solution to the normal equations, which gives the least squares fit with a multi-linear model. swang's response is correct (and EMS's elaboration)---you need to add a row of 1's to X. If you want some idea of why it works theoretically, keep in mind that you are finding b_i such that
y_j = sum_i b_i x_{ij}.
By adding a row of 1's, you are are setting x_{(k+1)j} = 1 for all j, which means that you are finding b_i such that:
y_j = (sum_i b_i x_{ij}) + b_{k+1}
because the k+1st x_ij term is always equal to one. Thus, b_{k+1} is your intercept term.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Covariance and correlation coefficient - python

Related

Probability Jaccard Similarity in numpy

How to calculate uncertainty of y values generated by fit functions

Is there a method to do arithmetic with SciPy's random variables?

How to perform a chi-squared goodness of fit test using scientific libraries in Python?

Multiple linear regression in python without fitting the origin?

Categories

Resources