I am trying to solve the following simple optimization problem. One seeks to find the global minimum variance portfolio, being the portfolio that minimizes variance with only one constraint : weights must sum to one.
Optimization program
This problem has a well-known closed-form solution: Solution
I'm trying to reproduce the results using CVXopt in Python, and I encounter a puzzling issue.
I have a 10x10 sample covariance matrix.
If I solve the problem with the entire 10x10 matrix, the output is incorrect : the "sum-to-one" constraint is not respected, and weights are different from the closed-form solution
If I solve the problem with a 7x7 subsample of the same matrix, the output is correct : the "sum-to-one" constraint is respected, and weights are equivalent to the closed-form solution
Actually, with any subsample of size equal or lower than 7x7, it works. But for any subsample of size higher or equal to 8x8, it does not work anymore. I can't grasp where the problem is coming from.
Thank you for your help!
Here is a sample code
%reset -sf #Clear Environment
import numpy as np
import cvxopt
from cvxopt import matrix as dmatrix
from cvxopt.solvers import qp, options
from numpy.linalg import inv
from numpy import ones
from numpy import transpose as t
# Sample 10x10 covariance matrix
sigmafull = np.array([[0.01449082, 0.00846992, 0.00846171, 0.00773097, 0.00878925,
0.00748843, 0.00672341, 0.00665912, 0.0068593 , 0.00827341],
[0.00846992, 0.00952205, 0.00766057, 0.00726647, 0.00781524,
0.00672368, 0.00642426, 0.00609368, 0.00617965, 0.00704281],
[0.00846171, 0.00766057, 0.00842194, 0.00700168, 0.00772423,
0.0061137 , 0.00612574, 0.00601041, 0.00621007, 0.00712152],
[0.00773097, 0.00726647, 0.00700168, 0.00687784, 0.00726901,
0.00573606, 0.00567145, 0.00556391, 0.00575279, 0.00660916],
[0.00878925, 0.00781524, 0.00772423, 0.00726901, 0.00860462,
0.00612804, 0.0061301 , 0.00603605, 0.00630947, 0.0075281 ],
[0.00748843, 0.00672368, 0.0061137 , 0.00573606, 0.00612804,
0.00634431, 0.0054793 , 0.00513665, 0.00511852, 0.00575049],
[0.00672341, 0.00642426, 0.00612574, 0.00567145, 0.0061301 ,
0.0054793 , 0.0055722 , 0.0050824 , 0.00512499, 0.00576934],
[0.00665912, 0.00609368, 0.00601041, 0.00556391, 0.00603605,
0.00513665, 0.0050824 , 0.00521583, 0.00510142, 0.00576414],
[0.0068593 , 0.00617965, 0.00621007, 0.00575279, 0.00630947,
0.00511852, 0.00512499, 0.00510142, 0.00547566, 0.00603528],
[0.00827341, 0.00704281, 0.00712152, 0.00660916, 0.0075281 ,
0.00575049, 0.00576934, 0.00576414, 0.00603528, 0.00756009]])
# sigma = sigmafull[0:8,0:8] #With this subsample, output is incorrect. n=8 (and for all n>8)
sigma = sigmafull[0:7,0:7] #With this subsample, output is correct. n=7 (and for all n<7)
sigma = dmatrix(sigma) #Formatting sigma to be a dense matrix for cvxopt
mu = dmatrix(np.zeros(n)) #We just want to minimize variance, hence vector of zeroes
#Format of the equality constraint : Ax = b
#We want the sum of x to be equal to 1
Amatrix = dmatrix(ones(n)).T #Vector of ones
bmatrix = dmatrix(1.0) #Scalar = 1
sol = qp(sigma, mu, None, None, A=Amatrix, b=bmatrix) #No inequality constraint
w_gmv = (inv(sigma)#ones(n))/(t(ones(n))#inv(sigma)#ones(n)) #Analytical solution which indeed does sum to 1
print(t(np.array(sol['x'])) - w_gmv) #If Vector of zeroes -> Weights are equivalent -> OK
print(sum(np.array(sol['x']))) #If equal to 1 -> OK
I am conducting PCA on a dataset. I am attempting to add a line in my 3d graph which shows the first principal component. I have tried a few methods but have not been able to display the first principal component as a line in my 3d graph. Any help is greatly appreciated. My code is as follows:
import numpy as np
np.set_printoptions (suppress=True, precision=5, linewidth=150)
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
file_name = 'C:/Users/data'
input_data = pd.read_csv (file_name + '.csv', header=0, index_col=0)
A = input_data.A.values.astype(float)
B = input_data.B.values.astype(float)
C = input_data.C.values.astype(float)
D = input_data.D.values.astype(float)
E = input_data.E.values.astype(float)
F = input_data.F.values.astype(float)
X = np.column_stack((A, B, C, D, E, F))
ncompo = int (input ("Number of components to study: "))
pca = PCA (n_components = ncompo)
pcafit = pca.fit(X)
cov_mat = np.cov(X, rowvar=0)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
perc = pcafit.explained_variance_ratio_
perc_x = range(1, len(perc)+1)
plt.plot(perc_x, perc)
plt.ylabel('Percentage of Variance Explained')
#3d Graph
le = LabelEncoder()
number = le.transform(input_data.Grade)
colormap = np.array(['green', 'blue', 'red', 'yellow'])
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(D, E, F, c=colormap[number])
Some remarks to begin with:
You are computing PCA twice! To compute PCA is to compute eigen values and eigen vectors of the covariance matrix. So, either you use the sklearn function pca.fit, either you do it yourself. But you don't need to do both, unless you want to discover pca.fit and see for yourself that it does exactly what you expect it to do (if this is what you wanted, fine. It is a good thing to do that king of checking. I did this once also). Of course pca.fit has another advantage: once you have it, it also provides pca.predict to project points in the components space. But that also is simply a base change using eigenvectors matrix (that is matrix to change base)
pca object let you get the eigenvectors (pca.components_) and eigen values (pca.explained_variance_)
pca.fit is a 'inplace' method. It does not return a new PCA object. It just fit the one you have. So, no need to get pcafit and use it.
This is not a minimal reproducible exemple as required on SO. We should be able to copy and paste it, and run it, so see exactly your problem. Not to guess what kind of secret data you have. And in the meantime, it should be minimal. So, contains data example generation (it doesn't matter if those data doesn't make sense. Sometimes it is even better, since it allows some testing. In my following code, I generate my own noisy data along an axis, which allow me to verify that, indeed, I am able to "guess" what was that axis). Plus, since your problem concerns only 3d plot, there is no need to include ploting of explained variance here. That part is not part of your question.
Now, to print the principal component, well, you already did the hard part. Twice. That is to compute it. It is the eigenvector associated with the highest eigenvalue.
With pca object no need to search for it, they are already sorted. So it is simply pca.components_[0]. And since you want to plot in the space D,E,F, you simply need to draw vector pca.components_[0][3:].
With correct scaling.
You can do that with plot providing just 2 points (first and last)
Here is my version (which, by the way, shows also what a minimal reproducible example is)
import numpy as np
np.set_printoptions (suppress=True, precision=5, linewidth=150)
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import LabelEncoder
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
# Generation of random data along a given vector
vec=np.array([1, -1, 0.5, -0.5, 0.75, 0.75]).reshape(-1,1)
# 10000 random data, that are U[0,10]×vec + gaussian noise std=1
X=(vec*np.random.rand(10000)*10 + np.random.normal(0,1,(6,10000))).T
input_data = pd.DataFrame({'A':A,'B':B,'C':C,'D':D,'E':E, 'F':F, 'Grade':np.random.randint(1,5, (10000,))})
pca = PCA (n_components = ncompo)
# Redundant
cov_mat = np.cov(X, rowvar=0)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
# See
print("Eigen values")
print("Eigen vec")
# Note, compare first components to
print("Main component")
#3d Graph
le = LabelEncoder()
number = le.transform(input_data.Grade)
fig = plt.figure()
colormap = np.array(['green', 'blue', 'red', 'yellow'])
ax = fig.add_subplot(111, projection='3d')
ax.scatter(D, E, F, c=colormap[number])
# Draw the 1st principal component as a blue line
ax.plot([sc1*U[3],sc2*U[3]], [sc1*U[4], sc2*U[4]], [sc1*U[5], sc2*U[5]], linewidth=3)
My example is not that minimal, because I took advantage of it to illustrate my first remark, and also computed PCA twice, to compare both result.
So, here I print, eigenvalues
Eigen values
[30.88941 1.01334 0.99512 0.96493 0.97692 0.98101]
[30.88941 1.01334 0.99512 0.98101 0.97692 0.96493]
(1st being your computation by diagonalisation of covariance matrix, 2nd pca.explained_variance_)
As you can see, they are the same, except sorting for the 1st one
Like wise,
Eigen vec
[[-0.52251 -0.27292 0.40863 -0.06321 0.26699 0.6405 ]
[ 0.52521 0.07577 -0.34211 0.27583 -0.04161 0.72357]
[-0.26266 -0.41332 -0.60091 0.38027 0.47573 -0.16779]
[ 0.26354 -0.52548 0.47284 0.59159 -0.24029 -0.15204]
[-0.39493 0.63946 0.07496 0.64966 -0.08619 0.00252]
[-0.3959 -0.25276 -0.35452 -0.0572 -0.79718 0.12217]]
[[ 0.52251 -0.52521 0.26266 -0.26354 0.39493 0.3959 ]
[-0.27292 0.07577 -0.41332 -0.52548 0.63946 -0.25276]
[-0.40863 0.34211 0.60091 -0.47284 -0.07496 0.35452]
[-0.6405 -0.72357 0.16779 0.15204 -0.00252 -0.12217]
[-0.26699 0.04161 -0.47573 0.24029 0.08619 0.79718]
[-0.06321 0.27583 0.38027 0.59159 0.64966 -0.0572 ]]
Also the same, but for sorting and transpose.
Eigen vectors are presented column wise when you diagonalize a matrix.
Where as for pca.components_ each line is an eigen vector.
But you can see that in the 1st matrix, the eigen vector associated to the biggest eigen value, that is, since biggest eigen value was the 1st one, the 1st column (-0.52, 0.52, etc.)
is also the same as the first line of pca.components_.
Like wise, the 4th biggest eigen value in your diagonalisation was the last one.
And if you look at the last column of your eigen vectors (0.64, 0.72, -0.76...), it is the same as the 4th line of pca.components_ (with a irrelevant ×-1 factor)
So, long story short, you already have eigenvals in pca.explained_variance_ sorted from the biggest to the smallest. And eigen vectors in pca_components_, in the same order.
Last thing I print here, is comparison between the first component (pca.components_[0]) and the vector I used to generate the data in the first place (my data are all colinear to a vector vec, + a gaussian noise).
Main component
[[ 0.52523]
[ 0.26261]
[ 0.39392]
[ 0.39392]]
[ 0.52251 -0.52521 0.26266 -0.26354 0.39493 0.3959 ]
As expected, PCA did find correctly that main axis.
So, that was just side comments.
What is really what you were looking for is
ax.plot([sc1*U[3],sc2*U[3]], [sc1*U[4], sc2*U[4]], [sc1*U[5], sc2*U[5]], linewidth=3)
sc1 and sc2 being just scaling factors (here I choose it so that it scales approx like the data. Another way would have been to set ax.set_xlim, ax.set_ylim, ax.set_zlim from D.min(), D.max(), E.min(), E.max(), etc.
And then just use big values for sc1 and sc2, like
I am trying to use python (and at present failing) to come to a more efficient solution than Excel Solver provides for an optimization problem.
The problem is the form AB=C -->D
Where AB produces C where the absolute value for C-D for each row in the matrix is minimized.
I have seven funds contained in matrix B all of which have geographic exposure of the form
FUND_NAME = np.array([UK,USA,EuroZone, Japan,EM,Apac)]
as below
RLS = np.array([0.788743177, 0.168048481,0,0.043208342,0,0])
From this I need to construct an optimal weighting of the seven funds using a matrix (imaginatively named A) [x1,x2,x3,x4,x5,x6,x7] with x1+x2+...+x7=1 & Also for i=(1,7)
xi lower bound =0
xi upper bound =0.25
To arrive at the actual regional weights (matrix C)as close as possible to the below Target array (which corresponds to matrix D above)
I've tried using libprog. But I know that the answer I am getting is wrong.
twentyfive=np.full((1, 7), 0.25)
res = linprog(Target,A_ub=Funds,b_ub=twentyfive,bounds=[bounds])
Can anyone help me move on from excel ?
This is really a LAD regression problem (LAD=Least Absolute Deviation) with some side constraints. Different LP formulations for the LAD regression problems can be found here. Based on the sparse bounding problem, we can formulate the LP model:
This is the mathematical model I am going to try to solve with linprog. The coloring as as follows: blue symbols represent data, red symbols are the decision variables. x are the allocations (fractions) we need to find, d are the residuals of the linear fit and r are the absolute values of d.
linprog requires an explicit LP matrix. For the model above, this A matrix can look like:
With this it is no longer very difficult to develop a Python implementation. The Python code can look like:
import numpy as np
import scipy.optimize as sp
B = np.array([[0.788743177, 0.168048481,0,0.043208342,0,0],
target = np.array([0.2310,0.2576,0.1047,0.1832,0.1103,0.1131])
m,n = np.shape(B)
A_eq = np.block([[B, np.eye(m), np.zeros((m,m))],
[np.ones(n), np.zeros(m), np.zeros(m)]])
A_ub = np.block([[np.zeros((m,n)),-np.eye(m), -np.eye(m)],
[np.zeros((m,n)),np.eye(m), -np.eye(m)]])
b_eq = np.block([target,1])
b_ub = np.zeros(2*m)
c = np.block([np.zeros(n),np.zeros(m),np.ones(m)])
bnd = n*[(0,0.25)] + m*[(None,None)] + m*[(0,None)]
res = sp.linprog(c,A_ub,b_ub,A_eq,b_eq,bnd,options={'disp':True})
allocation = res.x[0:n]
The results look like:
Primal Feasibility Dual Feasibility Duality Gap Step Path Parameter Objective
1.0 1.0 1.0 - 1.0 6.0
0.3777262386888 0.3777262386888 0.3777262386888 0.6478228594143 0.3777262386888 0.3200496644143
0.08438152300367 0.08438152300366 0.08438152300367 0.8087424108466 0.08438152300366 0.1335722585582
0.01563291142478 0.01563291142478 0.01563291142478 0.8341722620104 0.01563291142478 0.1118298108651
0.004083901923022 0.004083901923022 0.004083901923023 0.7432737130498 0.004083901923024 0.1049630948572
0.0006190254179117 0.0006190254179117 0.0006190254179116 0.8815177164943 0.000619025417913 0.1016021916581
3.504935403199e-05 3.504935403066e-05 3.504935403079e-05 0.9676694788778 3.504935402756e-05 0.1012177893279
5.983549975387e-07 5.98354980932e-07 5.983549810074e-07 0.9885372873161 5.983549719474e-07 0.1011921413019
3.056236812029e-11 3.056401712736e-11 3.056394819773e-11 0.9999489201822 3.056087926755e-11 0.1011915586046
Optimization terminated successfully.
Current function value: 0.101192
Iterations: 8
[2.31621461e-01 2.50000000e-01 9.07425872e-12 2.50000000e-01
4.45030949e-10 2.39692743e-01 2.86857955e-02]
I want factor loadings to see which factor loads to which variables. I am referring to following link:
Factor Loadings using sklearn
Here is my code where input_data is the master_data.
#Scaling the values
X = scale(X)
#taking equal number of components as equal to number of variables
#intially we have 9 variables
pca = PCA(n_components=9)
#The amount of variance that each PC explains
var= pca.explained_variance_ratio_
#Cumulative Variance explains
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
print var1
[ 74.75 85.85 94.1 97.8 98.87 99.4 99.75 100. 100. ]
#Retaining 4 components as they explain 98% of variance
pca = PCA(n_components=4)
print pca.components_
array([[ 0.38454129, 0.37344315, 0.2640267 , 0.36079567, 0.38070046,
0.37690887, 0.32949014, 0.34213449, 0.01310333],
[ 0.00308052, 0.00762985, -0.00556496, -0.00185015, 0.00300425,
0.00169865, 0.01380971, 0.0142307 , -0.99974635],
[ 0.0136128 , 0.04651786, 0.76405944, 0.10212738, 0.04236969,
0.05690046, -0.47599931, -0.41419841, -0.01629199],
[-0.09045103, -0.27641087, 0.53709146, -0.55429524, 0.058524 ,
-0.19038107, 0.4397584 , 0.29430344, 0.00576399]])
import math
loadings = pca.components_.T * math.sqrt(pca.explained_variance_)
It gives me following error 'only length-1 arrays can be converted to Python scalars
I understand the problem. I have to traverse the pca.components_ and pca.explained_variance_ arrays such as:
##just a thought
for i,j in (pca.components_, pca.explained_variance_):
##unable to proceed further
##something wrong here
This is simply a problem of mixing modules. For numpy arrays, use np.sqrt instead of math.sqrt (which only works on single values, not arrays).
Your last line should thus read:
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
This is a mistake in the original answers you linked to. I have edited them accordingly.
My goal is to create a Python3 program to operate differential operations on a vector V of size N. I did so, test it for basic operation and it works (differentiation, gradient...).
I tried to write with that basis more complex equations (Navier-Stokes, Orr-Sommerfeld,...) and I tried to validate my work by calculating the eigenvalues of these equations.
As these eigenvalues were completely unexpected, I simplify my problem and I am currently trying to calculate the eigenvalues only for the differentiation matrix (see below). But the results seem wrong...
Thanks in advance for your help, because I do not find any solution to my problem...
Definition of DM:
I use Chebyshev spectral method to operate the differentiation of vectors.
I use the following Chebyshev package (translated from Matlab to Python):
That package allow me to create a differentiation matrix DM, obtained with:
nodes, DM = chebyshev.chebdiff(N, maximal_order)
To obtain the 1st, 2nd, 3th... order differentiation, I write for example:
dVdx1 = np.dot(DM[0,:,:], V)
d2Vdx2 = np.dot(DM[1,:,:], V)
d3Vdx3 = np.dot(DM[2,:,:], V)
I tested that and it works.
I've build different operators based on that differentiation process.
I've tried to validate them by finding their eigenvalues. It didn't go well so I am just trying right now with DM only.
I do not manage to find the right eigenvalues of DM.
I've tried with different functions:
sympy.solve( (DM-x*np.eye).det(), x) [for snall size only]
Why I use scipy.sparse.LinearOperator:
I do not want to directly use the matrix DM, so I wrapped into a function which operates the differentiation (see code below) like that:
dVdx1 = derivative(V)
The reason why I do that comes from the global project itself.
This is useful for more complicated equations.
Creating such a function prevents me from using directly the matrix DM to find its eigenvalues (because DM stay inside the function).
For that reason, I use a scipy.sparse.LinearOperator to wrap my method derivative() and use it as an input of scipy.sparse.eig().
Code and results:
Here is the code to compute these eigenvalues:
import numpy as np
import scipy
import sympy
from scipy.sparse.linalg import aslinearoperator
from scipy.sparse.linalg import eigs
from scipy.sparse.linalg import LinearOperator
import chebyshev
N = 20 # should be 4, 20, 50, 100, 300
max_order = 4
option = 1
#option 1: building the differentiation matrix DM for a given order
if option == 1:
if 0:
# usage of package chebyshev, but I add a file with the matrix inside
nodes, DM = chebyshev.chebdiff(N, max_order)
order = 1
DM = DM[order-1,:,:]
#outfile = TemporaryFile()
np.save('DM20', DM)
if 1:
# loading the matrix from the file
# uncomment depending on N
#DM = np.load('DM4.npy')
DM = np.load('DM20.npy')
#DM = np.load('DM50.npy')
#DM = np.load('DM100.npy')
#DM = np.load('DM300.npy')
#option 2: building a random matrix
elif option == 2:
j = np.complex(0,1)
Real = np.random.random((N, N)) - 0.5
Im = np.random.random((N,N)) - 0.5
# If I want DM symmetric:
#Real = np.dot(Real, Real.T)
#Im = np.dot(Im, Im.T)
DM = Real + j*Im
# If I want DM singular:
#DM[0,:] = DM[1,:]
# Test DM symmetric
print('Is DM symmetric ? \n', (DM.transpose() == DM).all() )
# Test DM Hermitian
print('Is DM hermitian ? \n', (DM.transpose().real == DM.real).all() and
(DM.transpose().imag == -DM.imag).all() )
# building a linear operator which wrap matrix DM
def derivative(v):
return np.dot(DM, v)
linop_DM = LinearOperator( (N, N), matvec = derivative)
# building a linear operator directly from a matrix DM with asLinearOperator
aslinop_DM = aslinearoperator(DM)
# comparison of LinearOperator and direct Dot Product
V = np.random.random((N))
diff_lo = linop_DM.matvec(V)
diff_mat = np.dot(DM, V)
# diff_lo and diff_mat are equals
#number of eigenvalues to find
k = 1
if 1:
vals_sparse, vecs = scipy.sparse.linalg.eigs(linop_DM, k, which='SR',
maxiter = 10000,
tol = 1E-3)
vals_sparse = np.sort(vals_sparse)
print('\nEigenvalues (scipy.sparse.linalg Linear Operator) : \n', vals_sparse)
if 1:
vals_sparse2, vecs2 = scipy.sparse.linalg.eigs(DM, k, which='SR',
maxiter = 10000,
tol = 1E-3)
vals_sparse2 = np.sort(vals_sparse2)
print('\nEigenvalues (scipy.sparse.linalg with matrix DM) : \n', vals_sparse2)
if 1:
vals_sparse3, vecs3 = scipy.sparse.linalg.eigs(aslinop_DM, k, which='SR',
maxiter = 10000,
tol = 1E-3)
vals_sparse3 = np.sort(vals_sparse3)
print('\nEigenvalues (scipy.sparse.linalg AS linear Operator) : \n', vals_sparse3)
if 0:
vals_np = np.linalg.eigvals(DM)
vals_np = np.sort(vals_np)
print('\nEigenvalues (numpy.linalg) : \n', vals_np)
if 1:
vals_sp = scipy.linalg.eig(DM)
vals_sp = np.sort(vals_sp[0])
print('\nEigenvalues (scipy.linalg.eig) : \n', vals_sp)
if 0:
x = sympy.Symbol('x')
D = sympy.Matrix(DM)
print('\ndet D (sympy):', D.det() )
E = D - x*np.eye(DM.shape[0])
eig_sympy = sympy.solve(E.det(), x)
print('\nEigenvalues (sympy) : \n', eig_sympy)
Here are my results (for N=20):
Is DM symmetric ?
Is DM hermitian ?
Eigenvalues (scipy.sparse.linalg Linear Operator) :
Eigenvalues (scipy.sparse.linalg with matrix DM) :
Eigenvalues (scipy.sparse.linalg AS linear Operator) :
Eigenvalues (scipy.linalg.eig) :
[-2.92933791+0.j -2.72062839-1.01741142j -2.72062839+1.01741142j
-2.15314244-1.84770128j -2.15314244+1.84770128j -1.36473659-2.38021351j
-1.36473659+2.38021351j -0.49536645-2.59716913j -0.49536645+2.59716913j
0.38136094-2.53335888j 0.38136094+2.53335888j 0.55256471-1.68108134j
0.55256471+1.68108134j 1.26425751-2.25101241j 1.26425751+2.25101241j
2.03390489-1.74122287j 2.03390489+1.74122287j 2.57770573-0.95982011j
2.57770573+0.95982011j 2.77749810+0.j ]
The values returned by scipy.sparse should be included in the ones found by scipy/numpy, which is not the case. (idem for sympy)
I've tried with different random matrices instead of DM (see option 2) (symmetric, non-symmetric, real, imaginary, etc...), which had small size N (4,5,6..) and also bigger ones (100,...).
That worked
By changing parameters like 'which' (LM, SM, LR...), 'tol' (10E-3, 10E-6..), 'maxiter', 'sigma' (0) in scipy.sparse... scipy.sparse.linalg.eigs always worked for random matrices but never for my matrix DM. In best cases, found eigenvalues are close to the ones found by scipy, but never match.
I really do not know what is so particular in my matrix.
I also dont know why using scipy.sparse.linagl.eig with a matrix, a LinearOperator or a AsLinearOperator gives different results.
For N = 4 :
[[ 3.16666667 -4. 1.33333333 -0.5 ]
[ 1. -0.33333333 -1. 0.33333333]
[-0.33333333 1. 0.33333333 -1. ]
[ 0.5 -1.33333333 4. -3.16666667]]
Every idea is welcome.
May a moderator could tag my question with :
scipy.sparse.linalg.eigs / weideman / eigenvalues / scipy.eig /scipy.sparse.lingalg.linearOperator
I spoke with a few colleague and solve partly my problem.
My conclusion is that my matrix is simply very ill conditioned...
In my project, I can simplify my matrix by imposing boundary condition as follow:
DM[0,:] = 0
DM[:,0] = 0
DM[N-1,:] = 0
DM[:,N-1] = 0
which produces a matrix similar to that for N=4:
[[ 0 0 0 0]
[ 0 -0.33333333 -1. 0]
[ 0 1. 0.33333333 0]
[ 0 0 0 0]]
By using such condition, I obtain eigenvalues for scipy.sparse.linalg.eigs which are equal to the one in scipy.linalg.eig.
I also tried using Matlab, and it return the same values.
To continue my work, I actually needed to use the generalized eigenvalue problem in the standard form
λ B x= DM x
It seems that it does not work in my case because of my matrix B (which represents a Laplacian operator matrix).
If you have a similar problem, I advise you to visit that question:
(I think that) the matrix B needs to be positive definite to use scipy.sparse.
A solution would be to change B, to use scipy.linalg.eig or to use Matlab.
I will confirm that later.
I wrote a solution to the stack exchange question I post above which explains how I solve my problem.
I appears that scipy.sparse.linalg.eigs has indeed a bug if matrix B is not positive definite, and will return bad eigenvalues.
When choosing the number of principal components (k), we choose k to be the smallest value so that for example, 99% of variance, is retained.
However, in the Python Scikit learn, I am not 100% sure pca.explained_variance_ratio_ = 0.99 is equal to "99% of variance is retained"? Could anyone enlighten? Thanks.
The Python Scikit learn PCA manual is here
Yes, you are nearly right. The pca.explained_variance_ratio_ parameter returns a vector of the variance explained by each dimension. Thus pca.explained_variance_ratio_[i] gives the variance explained solely by the i+1st dimension.
You probably want to do pca.explained_variance_ratio_.cumsum(). That will return a vector x such that x[i] returns the cumulative variance explained by the first i+1 dimensions.
import numpy as np
from sklearn.decomposition import PCA
my_matrix = np.random.randn(20, 5)
my_model = PCA(n_components=5)
print my_model.explained_variance_
print my_model.explained_variance_ratio_
print my_model.explained_variance_ratio_.cumsum()
[ 1.50756565 1.29374452 0.97042041 0.61712667 0.31529082]
[ 0.32047581 0.27502207 0.20629036 0.13118776 0.067024 ]
[ 0.32047581 0.59549787 0.80178824 0.932976 1. ]
So in my random toy data, if I picked k=4 I would retain 93.3% of the variance.
Although this question is older than 2 years i want to provide an update on this.
I wanted to do the same and it looks like sklearn now provides this feature out of the box.
As stated in the docs
if 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components
So the code required is now
my_model = PCA(n_components=0.99, svd_solver='full')
This worked for me with even less typing in the PCA section.
The rest is added for convenience. Only 'data' needs to be defined in an earlier stage.
import sklearn as sl
from sklearn.preprocessing import StandardScaler as ss
from sklearn.decomposition import PCA
st = ss().fit_transform(data)
pca = PCA(0.80)
pc = pca.fit_transform(st) # << to retain the components in an object
print ( "Components = ", pca.n_components_ , ";\nTotal explained variance = ",
round(pca.explained_variance_ratio_.sum(),5) )