Chi-square value from xarray polyfit - python

How can we derive the chi square values of the polynomial fit using xarray.Dataset.polyfit.
From what I can see in he documentation, it returns residuals and covariance matrix.
for example, I can calculate the 2nd order polynomial fit across dim1 of my dataarray
fit_param = ds.polyfit(dim='dim1', deg=2, cov=True)

Related

Variance of each component after FastICA

After performing independent component analysis through FastICA, how can I calculate the variance captured by individual components (or all components)?
For PCA it is very straight forward, the variance explained by the components equals to the eigen values of the covariance matrix of X. But for ICA, how should I proceed?

How to apply chi-square distance on categorical dataset?

I want to apply chi-square distance on a categorical dataset (219 x 55).
As I understand, categorical data must be encoded first before applying the chi-square formula (reference, P.10).
The formula for chi-square distance is as follow:
Where the row totals is denoted
and the column totals are
.
I am struggling to understand what sort of output I will be getting from applying this formula to my dataset. Is it a matrix of distances between rows that symmetrical across the diagonal (similar to the one found in the reference)?
Or is it a matrix with the same proportion of my dataset but each value is substituted with the distance?
Finally, is there a method for chi-square distance in python?
I couldn't find a Python package implementing the $\chi^2$ distance, but the TraMineR package in R implements it (the biofam.chi functions). That function takes in an m x n matrix and returns an m x m matrix symmetrical across the diagonal:
library(TraMineR)
data(biofam)
biofam.seq <- seqdef(biofam[501:600, 10:25])
dim(biofam.seq)
[1] 100 16
biofam.chi.full <- seqdist(biofam.seq, method = "CHI2",
step = max(seqlength(biofam.seq)))
dim(biofam.chi.full)
[1] 100 100
isSymmetric(biofam.chi.full)
[1] TRUE

Multivariate Normal Distribution fitting dataset

I was reading a few papers about RNN networks. At some point, I came accross the following explanations:
The prediction model trained on sN is used to compute the error vectors for
each point in the validation and test sequences. The error vectors are modelled
to fit a multivariate Gaussian distribution N = N (μ, Σ). The likelihood p(t)
of observing an error vector e(t) is given by the value of N at e(t) (similar to
normalized innovations squared (NIS) used for novelty detection using Kalman
filter based dynamic prediction model [5]). The error vectors for the points from
vN1 are used to estimate the parameters μ and Σ using Maximum Likelihood
Estimation.
And:
A Multivariate Gaussian Distribution is fitted to the error
vectors on the validation set. y
(t)
is the probability of an error
vector e
(t)
after applying Multivariate Gaussian Distribution
N = N (µ, ±). Maximum Likelihood Estimation is used to
select the parameters µ and Σ for the points from vN.
vN or vN1 are validaton datasets. sN is the training dataset.
They are from 2 different articles but describe the same thing. I didn't really understand what they mean by fitting a Multivariate Gaussian Distribution to the data. What does it mean?
Many thanks,
Guillaume
Let's start with one dimensional data first. If you have a data distributed in a 1D line, they have a mean (µ) and variance (sigma). Then modeling them is as simple as having (µ, sigma) to generate a new data point following your main distribution.
# Generating a new_point in a 1D Gaussian distribution
import random
mu, sigma = 1, 1.6
new_point = random.gauss(mu, sigma)
# 2.797757476598497
Now in N dimensional space, multivariate normal distribution is a generalization of the one-dimensional. The objective in general is to find N averages µ and N x N covariances this time noted by Σ to model all data points in the N dimensional space. Having them, you are able to generate as many random data points as you want following the main distributions. In Python/ Numpy, you can do it like:
import numpy as np
new_data_point = np.random.multivariate_normal(mean, covariance, 1)

Covariance matrix for circular variables?

In my current project, I have a collection of three-dimensional samples such as [-0.5,-0.1,0.2]*pi, [0.8,-0.1,-0.4]*pi. These variables are circular/periodic, with their values ranging from -pi to +pi. It is my goal to calculate a 3-by-3 covariance matrix for these circular variables.
Python has an in-built function to calculate circular standard deviations, which I can use to calculate the standard deviations along each dimension, then use them to create a diagonal covariance matrix (i.e., without any correlation). Ideally, however, I would like to consider correlations between the parameters as well. Is there a way to calculate correlations between circular variables, or to directly compute the covariance matrix between them?
import numpy as np
import scipy.stats
# A collection of N circular samples
samples = np.asarray(
[[0.384917, 1.28862, -2.034],
[0.384917, 1.28862, -2.034],
[0.759245, 1.16033, -2.57942],
[0.45797, 1.31103, 2.9846],
[0.898047, 1.20955, -3.02987],
[1.25694, 1.74957, 2.46946],
[1.02173, 1.26477, 1.83757],
[1.22435, 1.62939, 1.99264]])
# Calculate the circular standard deviations
stds = scipy.stats.circstd(samples, high = np.pi, low = -np.pi, axis = 0)
# Create a diagonal covariance matrix
cov = np.identity(3)
np.fill_diagonal(cov,stds**2)

How to obtain the chi squared value as an output of scipy.optimize.curve_fit?

Is it possible to obtain the value of the chi squared as a direct output of scipy.optimize.curve_fit()?
Usually, it is easy to compute it after the fit by squaring the difference between the model and the data, weighting by the uncertainties and summing all up. However, it is not as direct when the parameter sigma is passed a 2D matrix (the covariance matrix of the data) instead of a simple 1D array.
Are really the best-fit parameters and its covariance matrix the only two outputs that can be extracted from curve_fit()?
It is not possible to obtain the value of chi^2 from scipy.optimize.curve_fit directly without manual calculations. It is possible to get additional output from curve_fit besides popt and pcov by providing the argument full_output=True, but the additional output does not contain the value of chi^2. (The additional output is documented e.g. at leastsq here).
In the case where sigma is a MxM array, the definition of the chi^2 function minimized by curve_fit is slightly different.
In this case, curve_fit minimizes the function r.T # inv(sigma) # r, where r = ydata - f(xdata, *popt), instead of chisq = sum((r / sigma) ** 2) in the case of one dimensional sigma, see the documentation of the parameter sigma.
So you should also be able to calculate chi^2 in your case by using r.T # inv(sigma) # r with your optimized parameters.
An alternative would be to use another package, for example lmfit, where the value of chi square can be directly obtained from the fit result:
from lmfit.models import GaussianModel
model = GaussianModel()
# create parameters with initial guesses:
params = model.make_params(center=9, amplitude=40, sigma=1)
result = model.fit(n, params, x=centers)
print(result.chisqr)

Categories