Capturing high multi-collinearity in statsmodels

Capturing high multi-collinearity in statsmodels - python

Say I fit a model in statsmodels
mod = smf.ols('dependent ~ first_category + second_category + other', data=df).fit()
When I do mod.summary() I may see the following:
Warnings:
[1] The condition number is large, 1.59e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Sometimes the warning is different (e.g. based on eigenvalues of the design matrix). How can I capture high-multi-collinearity conditions in a variable? Is this warning stored somewhere in the model object?
Also, where can I find a description of the fields in summary()?

You can detect high-multi-collinearity by inspecting the eigen values of correlation matrix. A very low eigen value shows that the data are collinear, and the corresponding eigen vector shows which variables are collinear.
If there is no collinearity in the data, you would expect that none of the eigen values are close to zero:
>>> xs = np.random.randn(100, 5) # independent variables
>>> corr = np.corrcoef(xs, rowvar=0) # correlation matrix
>>> w, v = np.linalg.eig(corr) # eigen values & eigen vectors
>>> w
array([ 1.256 , 1.1937, 0.7273, 0.9516, 0.8714])
However, if say x[4] - 2 * x[0] - 3 * x[2] = 0, then
>>> noise = np.random.randn(100) # white noise
>>> xs[:,4] = 2 * xs[:,0] + 3 * xs[:,2] + .5 * noise # collinearity
>>> corr = np.corrcoef(xs, rowvar=0)
>>> w, v = np.linalg.eig(corr)
>>> w
array([ 0.0083, 1.9569, 1.1687, 0.8681, 0.9981])
one of the eigen values (here the very first one), is close to zero. The corresponding eigen vector is:
>>> v[:,0]
array([-0.4077, 0.0059, -0.5886, 0.0018, 0.6981])
Ignoring almost zero coefficients, above basically says x[0], x[2] and x[4] are colinear (as expected). If one standardizes xs values and multiplies by this eigen vector, the result will hover around zero with small variance:
>>> std_xs = (xs - xs.mean(axis=0)) / xs.std(axis=0) # standardized values
>>> ys = std_xs.dot(v[:,0])
>>> ys.mean(), ys.var()
(0, 0.0083)
Note that ys.var() is basically the eigen value which was close to zero.
So, in order to capture high multi-linearity, look at the eigen values of correlation matrix.

Based on a similar question for R, there are some other options that may help people. I was looking for a single number that captured the collinearity, and options include the determinant and condition number of the correlation matrix.
According to one of the R answers, determinant of the correlation matrix will "range from 0 (Perfect Collinearity) to 1 (No Collinearity)". I found the bounded range helpful.
Translated example for determinant:
import numpy as np
import pandas as pd
# Create a sample random dataframe
np.random.seed(321)
x1 = np.random.rand(100)
x2 = np.random.rand(100)
x3 = np.random.rand(100)
df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
# Now create a dataframe with multicollinearity
multicollinear_df = df.copy()
multicollinear_df['x3'] = multicollinear_df['x1'] + multicollinear_df['x2']
# Compute both correlation matrices
corr = np.corrcoef(df, rowvar=0)
multicollinear_corr = np.corrcoef(multicollinear_df, rowvar=0)
# Compare the determinants
print np.linalg.det(corr) . # 0.988532159861
print np.linalg.det(multicollinear_corr) . # 2.97779797328e-16
And similarly, the condition number of the covariance matrix will approach infinity with perfect linear dependence.
print np.linalg.cond(corr) . # 1.23116253259
print np.linalg.cond(multicollinear_corr) . # 6.19985218873e+15

Related

Computing covariance matrix of complex array with defined function is not matching while comparing with np.cov

I am trying to write a simple covariance matrix function in Python.
import numpy as np
def manual_covariance(x):
mean = x.mean(axis=1)
print(x.shape[1])
cov = np.zeros((len(x), len(x)), dtype='complex64')
for i in range(len(mean)):
for k in range(len(mean)):
s = 0
for j in range(len(x[1])): # 5 Col
s += np.dot((x[i][j] - mean[i]), (x[k][j] - mean[i]))
cov[i, k] = s / ((x.shape[1]) - 1)
return cov
With this function if I compute the covariance of:
A = np.array([[1, 2], [1, 5]])
man_cov = manual_covariance(A)
num_cov = np.cov(A)
My answer matches with the np.cov(), and there is no problem. But, when I use complex number instead, my answer does not match with np.cov()
A = np.array([[1+1j, 1+2j], [1+4j, 5+5j]])
man_cov = manual_covariance(A)
num_cov = cov(A)
Manual result:
[[-0.5+0.j -0.5+2.j]
[-0.5+2.j 7.5+4.j]]
Numpy cov result:
[[0.5+0.j 0.5+2.j]
[0.5-2.j 8.5+0.j]]
I have tried printing every statement, to check where it can go wrong, but I am not able to find a fault.

It is because the dot product of two complex vectors z1 and z2 is defined as z1 · z2*, where * means conjugation. If you use s += np.dot((x[i,j] - mean[i]), np.conj(x[k,j] - mean[i])) you should get the correct result, where we have used Numpy's conjugate function.

How to distinguish if Python or Matlab is wrong/faulty?

I am trying to use SVD and an Eigendecomposition for some data analysis using Dynamic Mode Decomposition. I am running into a simple problem of getting different results from Matlab and Python. I'm confused and don't know why Python is giving me wrong results/matrix values but everything looks (I think IS) correct.
So instead of using real data this time and looking at large data sets, I generated data. I will try to look at an eigenvalue plot after the eigendecomposition. I also use a delay embedding for the data because I will work with a data vector which is only (2x100), so I will perform a type of Hankel matrix to enrich the data with 10 delays.
clear all; close all; clc;
data = linspace(1,100);
data2 = linspace(2,101);
data = [data;data2];
numDelays = 10;
relTol= 10^-6;
%% Create first and second snap shot matrices for DMD. Any columns with missing
% data are not used.
disp('Constructing Data Matricies:')
X = zeros((numDelays+1)*size(data,1),size(data,2)-(numDelays+1));
Y = zeros(size(X));
for i = 1:numDelays+1
X(1 + (i-1)*size(data,1):i*size(data,1),:) = ...
data(:,(i):size(data,2)-(numDelays+1) + (i-1));
Y(1 + (i-1)*size(data,1):i*size(data,1),:) = ...
data(:,(i+1):size(data,2)-(numDelays+1) + (i));
end
[U,S,V] = svd(X);
r = find(diag(S)>S(1,1)*relTol,1,'last');
disp(['DMD subspace dimension:',num2str(r)])
U = U(:,1:r);
S = S(1:r,1:r);
V = V(:,1:r);
Atil = (U'*Y)*V*(S^-1);
[what,lambda] = eig(Atil);
Phi = (Y*V)*(S^-1)*what;
Keigs = diag(lambda);
tt = linspace(0,2*pi,101);
figure;
plot(real(Keigs),imag(Keigs),'ro')
hold on
plot(cos(tt),sin(tt),'--')
import scipy.io as sc
import math as m
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sys
from numpy import dot, multiply, diag, power, pi, exp, sin, cos, cosh, tanh, real, imag
from scipy.linalg import expm, sinm, cosm, fractional_matrix_power, svd, eig, inv
def dmd(X, Y, relTol):
U2,Sig2,Vh2 = svd(X, False) # SVD of input matrix
S = np.zeros((Sig2.shape[0], Sig2.shape[0])) # Create S matrix with zeros based on Diag of S
np.fill_diagonal(S, Sig2) # Fill diagonal of S matrix with the nonzero values
r = np.count_nonzero(np.diag(S) > S[0,0] * relTol) # rank truncation
U = U2[:,:r]
Sig = diag(Sig2)[:r,:r] #GOOD =)
V = Vh2.conj().T[:,:r]
Atil = dot(dot(dot(U.conj().T, Y), V), inv(Sig)) # build A tilde
print(Atil)
mu,W = eig(Atil)
Phi = dot(dot(dot(Y, V), inv(Sig)), W) # build DMD modes
return mu, Phi
data = np.array([(np.linspace(1,100,100)),(np.linspace(2,101,100))])
Data = np.array(data)
######### Choose number of Delays ###########
# observable (coordinates of feature points). Setting to zero means only
# experimental observables will be used.
numDelays = 10
relTol = 10**-6
########## Create Data Matrices for DMD ###############
# Create first and second snap shot matrices for DMD. Any columns with missing
# data are not used.
X = np.zeros(((numDelays + 1) * data.shape[0], data.shape[1] - (numDelays + 1)))
Y = np.zeros(X.shape)
for i in range(1, numDelays + 2):
X[0 + (i - 1) * Data.shape[0]:i * Data.shape[0], :] = Data[:, (i):Data.shape[1] - (numDelays + 1) + (i - 0)]
Y[0 + (i - 1) * Data.shape[0]:i * Data.shape[0], :] = Data[:, (i + 0):Data.shape[1] - (numDelays + 1) + (i)]
Keigs, Phi = dmd(X, Y, relTol)
tt = np.linspace(0,2*np.pi,101)
plt.figure()
plt.plot(np.cos(tt),np.sin(tt),'--')
plt.plot(Keigs.real,Keigs.imag,'ro')
plt.title('DMD Eigenvalues')
plt.xlabel(r'Real $\ lambda$')
plt.ylabel(r'Imaginary $\ lambda$')
# plt.axes().set_aspect('equal')
plt.show()
So in matlab and python, I get my eigenvalues to all sit on the unit circle (as expect) and I get precisely one, sitting at 1.
So the problem comes when I look at the matrices from SVD, they appear to have different values. The only matrix that is the same is the 'S or Sig' matrix. The rest will differ a number or +/- sign. The biggest thing that peaked my interest is the Atil matrix.
In matlab, it looks like,
[1.0157, -0.3116; 7.91229e-4, 0.9843]
And python it looks like,
[1.0, -4.508e-15; -4.439e-18, 1.0]
Now this may look slightly off due to numerical error possibly but when I look at real data and these differ, it messes up my analysis.

SVD of a non-square matrix is not unique in U and V. Even if you have a square matrix with non-zero, non-degenerate singular values, singular vectors in U and V are only unique up to a sign factor.
https://math.stackexchange.com/questions/644327/how-unique-on-non-unique-are-u-and-v-in-singular-value-decomposition-svd
Moreover, Matlab (LAPACK + BLAS) and scipy.linalg.svd may use different algorithms for SVD.
This can lead to the differences you have experienced.

Autocorrelation to estimate periodicity with numpy

I have a large set of time series (> 500), I'd like to select only the ones that are periodic. I did a bit of literature research and I found out that I should look for autocorrelation. Using numpy I calculate the autocorrelation as:
def autocorr(x):
norm = x - np.mean(x)
result = np.correlate(norm, norm, mode='full')
acorr = result[result.size/2:]
acorr /= ( x.var() * np.arange(x.size, 0, -1) )
return acorr
This returns a set of coefficients (r?) that when plot should tell me if the time series is periodic or not.
I generated two toy examples:
#random signal
s1 = np.random.randint(5, size=80)
#periodic signal
s2 = np.array([5,2,3,1] * 20)
When I generate the autocorrelation plots I obtain:
The second autocorrelation vector clearly indicates some periodicity:
Autocorr1 = [1, 0.28, -0.06, 0.19, -0.22, -0.13, 0.07 ..]
Autocorr2 = [1, -0.50, -0.49, 1, -0.50, -0.49, 1 ..]
My question is, how can I automatically determine, from the autocorrelation vector, if a time series is periodic? Is there a way to summarise the values into a single coefficient, e.g. if = 1 perfect periodicity, if = 0 no periodicity at all. I tried to calculate the mean but it is not meaningful. Should I look at the number of 1?

I would use mode='same' instead of mode='full' because with mode='full' we get covariances for extreme shifts, where just 1 array element overlaps self, the rest being zeros. Those are not going to be interesting. With mode='same' at least half of the shifted array overlaps the original one.
Also, to have the true correlation coefficient (r) you need to divide by the size of the overlap, not by the size of the original x. (in my code these are np.arange(n-1, n//2, -1)). Then each of the outputs will be between -1 and 1.
A glance at Durbin–Watson statistic, which is similar to 2(1-r), suggests that people consider its values below 1 to be a significant indication of autocorrelation, which corresponds to r > 0.5. So this is what I use below. For a statistically sound treatment of the significance of autocorrelation refer to statistics literature; a starting point would be to have a model for your time series.
def autocorr(x):
n = x.size
norm = (x - np.mean(x))
result = np.correlate(norm, norm, mode='same')
acorr = result[n//2 + 1:] / (x.var() * np.arange(n-1, n//2, -1))
lag = np.abs(acorr).argmax() + 1
r = acorr[lag-1]
if np.abs(r) > 0.5:
print('Appears to be autocorrelated with r = {}, lag = {}'. format(r, lag))
else:
print('Appears to be not autocorrelated')
return r, lag
Output for your two toy examples:
Appears to be not autocorrelated
Appears to be autocorrelated with r = 1.0, lag = 4

Matrix vector multiplication where the vector has been interpolated - Python

I have used the finite element method to approximate the laplace equation and thus have turned it into a matrix system AU = F where A is the stiffness vector and solved for U (not massively important for my question).
I have now got my approximation U, which when i find AU i should get the vector F (or at least similar) where F is:
AU gives the following plot for x = 0 to x = 1 (say, for 20 nodes):
I then need to interpolate U to a longer vector and find AU (for a bigger A too, but not interpolating that). I interpolate U by the following:
U_inter = interp1d(x,U)
U_rich = U_inter(longer_x)
which seems to work okay until i multiply it with the longer A matrix:
It seems each spike is at a node of x (i.e. the nodes of the original U). Does anybody know what could be causing this? The following is my code to find A, U and F.
import numpy as np
import math
import scipy
from scipy.sparse import diags
import scipy.sparse.linalg
from scipy.interpolate import interp1d
import matplotlib
import matplotlib.pyplot as plt
def Poisson_Stiffness(x0):
"""Finds the Poisson equation stiffness matrix with any non uniform mesh x0"""
x0 = np.array(x0)
N = len(x0) - 1 # The amount of elements; x0, x1, ..., xN
h = x0[1:] - x0[:-1]
a = np.zeros(N+1)
a[0] = 1 #BOUNDARY CONDITIONS
a[1:-1] = 1/h[1:] + 1/h[:-1]
a[-1] = 1/h[-1]
a[N] = 1 #BOUNDARY CONDITIONS
b = -1/h
b[0] = 0 #BOUNDARY CONDITIONS
c = -1/h
c[N-1] = 0 #BOUNDARY CONDITIONS: DIRICHLET
data = [a.tolist(), b.tolist(), c.tolist()]
Positions = [0, 1, -1]
Stiffness_Matrix = diags(data, Positions, (N+1,N+1))
return Stiffness_Matrix
def NodalQuadrature(x0):
"""Finds the Nodal Quadrature Approximation of sin(pi x)"""
x0 = np.array(x0)
h = x0[1:] - x0[:-1]
N = len(x0) - 1
approx = np.zeros(len(x0))
approx[0] = 0 #BOUNDARY CONDITIONS
for i in range(1,N):
approx[i] = math.sin(math.pi*x0[i])
approx[i] = (approx[i]*h[i-1] + approx[i]*h[i])/2
approx[N] = 0 #BOUNDARY CONDITIONS
return approx
def Solver(x0):
Stiff_Matrix = Poisson_Stiffness(x0)
NodalApproximation = NodalQuadrature(x0)
NodalApproximation[0] = 0
U = scipy.sparse.linalg.spsolve(Stiff_Matrix, NodalApproximation)
return U
x = np.linspace(0,1,10)
rich_x = np.linspace(0,1,50)
U = Solver(x)
A_rich = Poisson_Stiffness(rich_x)
U_inter = interp1d(x,U)
U_rich = U_inter(rich_x)
AUrich = A_rich.dot(U_rich)
plt.plot(rich_x,AUrich)
plt.show()

comment 1:
I added a Stiffness_Matrix = Stiffness_Matrix.tocsr() statement to avoid an efficiency warning. FE calculations are complex enough that I'll have to print out some intermediate values before I can identify what is going on.
comment 2:
plt.plot(rich_x,A_rich.dot(Solver(rich_x))) plots nice. The noise you get is the result of the difference between the inperpolated U_rich and the true solution: U_rich-Solver(rich_x).
comment 3:
I don't think there's a problem with your code. The problem is with idea that you can test an interpolation this way. I'm rusty on FE theory, but I think you need to use the shape functions to interpolate, not a simple linear one.
comment 4:
Intuitively, with A_rich.dot(U_rich) you are asking, what kind of forcing F would produce U_rich. Compared to Solver(rich_x), U_rich has flat spots, regions where it's value is less than the true solution. What F would produce that? One that is spiky, with NodalQuadrature(x) at the x points, but near zero values in between. That's what your plot is showing.
A higher order interpolation will eliminate the flat spots, and produce a smoother back calculated F. But you really need to revisit the FE theory.
You might find it instructive to look at
plt.plot(x,NodalQuadrature(x))
plt.plot(rich_x, NodalQuadrature(rich_x))
The second plot is much smoother, but only about 1/5 as high.
Better yet look at:
plt.plot(rich_x,AUrich,'-*') # the spikes
plt.plot(x,NodalQuadrature(x),'o') # original forcing
plt.plot(rich_x, NodalQuadrature(rich_x),'+') # new forcing
In the model the forcing isn't continuous, it is a value at each node. With more nodes (rich_x) the magnitude at each node is less.

Python: finding the intersection point of two gaussian curves

I have two gaussian plots:
x = np.linspace(-5,9,10000)
plot1=plt.plot(x,mlab.normpdf(x,2.5,1))
plot2=plt.plot(x,mlab.normpdf(x,5,1))
and I want to find the point at where the two curves intersect. Is there a way of doing this? In particular I want to find the value of the x-coordinate where they meet.

You want to find the x's such that both gaussian functions have the same height.(i.e intersect)
You can do so by equating two gaussian functions and solve for x. In the end you will get a quadratic equation with coefficients relating to the gaussian means and variances. Here is the final result:
import numpy as np
def solve(m1,m2,std1,std2):
a = 1/(2*std1**2) - 1/(2*std2**2)
b = m2/(std2**2) - m1/(std1**2)
c = m1**2 /(2*std1**2) - m2**2 / (2*std2**2) - np.log(std2/std1)
return np.roots([a,b,c])
m1 = 2.5
std1 = 1.0
m2 = 5.0
std2 = 1.0
result = solve(m1,m2,std1,std2)
The output is :
array([ 3.75])
You can plot the found intersections:
x = np.linspace(-5,9,10000)
plot1=plt.plot(x,mlab.normpdf(x,m1,std1))
plot2=plt.plot(x,mlab.normpdf(x,m2,std2))
plot3=plt.plot(result,mlab.normpdf(result,m1,std1),'o')
The plot will be:
If your gaussians have multiple intersections, the code will also find all of them(say m1=2.5, std1=3.0, m2=5.0, std2=1.0):

Here's a solution based on purely numpy that is also applicable to curves other than Gaussian.
def get_intersection_locations(y1,y2,test=False,x=None):
"""
return indices of the intersection point/s.
"""
idxs=np.argwhere(np.diff(np.sign(y1 - y2))).flatten()
if test:
x=range(len(y1)) if x is None else x
plt.figure(figsize=[2.5,2.5])
ax=plt.subplot()
ax.plot(x,y1,color='r',label='line1',alpha=0.5)
ax.plot(x,y2,color='b',label='line2',alpha=0.5)
_=[ax.axvline(x[i],color='k') for i in idxs]
_=[ax.text(x[i],ax.get_ylim()[1],f"{x[i]:1.1f}",ha='center',va='bottom') for i in idxs]
ax.legend(bbox_to_anchor=[1,1])
ax.set(xlabel='x',ylabel='density')
return idxs
# single intersection
x = np.arange(-10, 10, 0.001)
y1=sc.stats.norm.pdf(x,-2,2)
y2=sc.stats.norm.pdf(x,2,3)
get_intersection_locations(y1=y1,y2=y2,x=x,test=True) # returns indice/s array([10173])
# double intersection
x = np.arange(-10, 10, 0.001)
y1=sc.stats.norm.pdf(x,-2,1)
y2=sc.stats.norm.pdf(x,2,3)
get_intersection_locations(y1=y1,y2=y2,x=x,test=True)
Based on an answer to a similar question.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Capturing high multi-collinearity in statsmodels - python

Related

Computing covariance matrix of complex array with defined function is not matching while comparing with np.cov

How to distinguish if Python or Matlab is wrong/faulty?

Autocorrelation to estimate periodicity with numpy

Matrix vector multiplication where the vector has been interpolated - Python

Python: finding the intersection point of two gaussian curves

Categories

Resources