Python scikit learn pca.explained_variance_ratio_ cutoff - python

When choosing the number of principal components (k), we choose k to be the smallest value so that for example, 99% of variance, is retained.
However, in the Python Scikit learn, I am not 100% sure pca.explained_variance_ratio_ = 0.99 is equal to "99% of variance is retained"? Could anyone enlighten? Thanks.
The Python Scikit learn PCA manual is here
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA

Yes, you are nearly right. The pca.explained_variance_ratio_ parameter returns a vector of the variance explained by each dimension. Thus pca.explained_variance_ratio_[i] gives the variance explained solely by the i+1st dimension.
You probably want to do pca.explained_variance_ratio_.cumsum(). That will return a vector x such that x[i] returns the cumulative variance explained by the first i+1 dimensions.
import numpy as np
from sklearn.decomposition import PCA
np.random.seed(0)
my_matrix = np.random.randn(20, 5)
my_model = PCA(n_components=5)
my_model.fit_transform(my_matrix)
print my_model.explained_variance_
print my_model.explained_variance_ratio_
print my_model.explained_variance_ratio_.cumsum()
[ 1.50756565 1.29374452 0.97042041 0.61712667 0.31529082]
[ 0.32047581 0.27502207 0.20629036 0.13118776 0.067024 ]
[ 0.32047581 0.59549787 0.80178824 0.932976 1. ]
So in my random toy data, if I picked k=4 I would retain 93.3% of the variance.

Although this question is older than 2 years i want to provide an update on this.
I wanted to do the same and it looks like sklearn now provides this feature out of the box.
As stated in the docs
if 0 < n_components < 1 and svd_solver == ‘full’, select the number of components such that the amount of variance that needs to be explained is greater than the percentage specified by n_components
So the code required is now
my_model = PCA(n_components=0.99, svd_solver='full')
my_model.fit_transform(my_matrix)

This worked for me with even less typing in the PCA section.
The rest is added for convenience. Only 'data' needs to be defined in an earlier stage.
import sklearn as sl
from sklearn.preprocessing import StandardScaler as ss
from sklearn.decomposition import PCA
st = ss().fit_transform(data)
pca = PCA(0.80)
pc = pca.fit_transform(st) # << to retain the components in an object
pc
#pca.explained_variance_ratio_
print ( "Components = ", pca.n_components_ , ";\nTotal explained variance = ",
round(pca.explained_variance_ratio_.sum(),5) )

Related

Reshaping error in multivariate normal function with Numpy - Python

I have this data (c4), I want to use 4-fold cross validation testing on this matrix.
The way that I'm splitting the data is as follows:
from scipy.stats import multivariate_normal
from sklearn.model_selection import KFold
import math
c4 = np.array([
[5,10,14,18,22,19,21,18,18,19,19,18,15,15,12,4,4,4,3,3,3,3,3,3,3,3,3,3,3,1],
[6,9,11,12,10,10,13,16,18,21,20,19,8,5,4,4,4,4,4,4,4,4,4,4,3,3,3,3,3,3],
[4,8,12,17,18,21,21,21,17,16,15,13,7,8,8,7,7,4,4,4,3,3,3,3,4,4,3,3,3,2],
[3,7,12,17,19,20,22,20,20,19,19,18,17,16,16,15,14,13,12,9,4,4,4,3,3,3,3,3,2,1],
[2,5,8,10,10,11,11,10,13,17,19,20,22,22,20,16,15,15,13,11,8,3,3,3,3,3,3,3,2,1],
[4,8,10,11,10,15,15,17,18,19,18,20,18,17,15,13,12,7,4,4,4,4,4,4,4,4,3,3,3,2],
[2,8,12,15,18,20,19,20,21,21,23,19,19,16,16,16,14,12,10,7,7,7,7,6,3,3,3,3,2,1],
[2,13,17,18,21,22,20,18,18,17,17,15,13,11,8,8,4,4,4,4,4,4,4,4,4,4,4,4,3,1],
[6,6,9,14,15,18,20,20,22,20,16,16,15,11,8,8,8,5,4,4,4,4,4,4,4,5,5,5,5,4],
[8,13,16,20,20,20,19,17,17,17,17,15,14,13,10,6,3,3,3,4,4,4,3,3,4,3,3,3,2,2],
[5,9,17,18,19,18,17,16,14,13,12,12,11,10,4,4,4,3,3,3,3,3,3,3,4,4,3,3,3,3],
[4,6,8,11,16,17,18,20,16,17,16,17,17,16,14,12,12,10,9,9,8,8,6,4,3,3,3,2,2,2] ])
kf = KFold(n_splits=4)
for train_index, test_index in kf.split(c4):
X_train, X_test = c4[train_index], c4[test_index]
X_train_mean = np.mean(X_train)
X_train_cov = np.cov(X_train.T)
v = multivariate_normal(X_train_mean, X_train_cov)
res = v.pdf(X_test)
print (res)
but it didn't work with me, despite that the splitting loop works well with small sample of data.
The error message that I got:
ValueError: cannot reshape array of size 900 into shape (1,1)
Note: the length of all rows is equal.
Thanks in advance.
You are taking the mean of entire matrix X_train when you do np.mean(X_train). What you should do is take mean across the sample axis i.e. if your features are across columns and different samples are across rows, then replace np.mean(X_train) by np.mean(X_train, axis=0). This should solve the error.
Including this line in the above code makes it work. Basically, np.mean(c4[test_index], axis=0) will given you a 1 x 30 mean vector instead of a scalar mean.
from scipy.stats import multivariate_normal as mvn
v = mvn(np.mean(c4[test_index], axis=0), X_train_cov + np.eye(30))
I had to add an identity matrix because I was getting a singular matrix error. However, that has to do with how c4 is defined and nothing to do with this code. Note that to avoid the singularity, you typically add a very small value on the diagonal and not an identity matrix. This is just for illustration.
What is multivariate_normal ? If it is from scipy.stats, then per the doc you must do
multivariate_normal.pdf(X_test, np.mean(X_train, axis=0), X_train_cov)
The doc is here.

Unable to extract factor loadings from sklearn PCA

I want factor loadings to see which factor loads to which variables. I am referring to following link:
Factor Loadings using sklearn
Here is my code where input_data is the master_data.
X=master_data_predictors.values
#Scaling the values
X = scale(X)
#taking equal number of components as equal to number of variables
#intially we have 9 variables
pca = PCA(n_components=9)
pca.fit(X)
#The amount of variance that each PC explains
var= pca.explained_variance_ratio_
#Cumulative Variance explains
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
print var1
[ 74.75 85.85 94.1 97.8 98.87 99.4 99.75 100. 100. ]
#Retaining 4 components as they explain 98% of variance
pca = PCA(n_components=4)
pca.fit(X)
X1=pca.fit_transform(X)
print pca.components_
array([[ 0.38454129, 0.37344315, 0.2640267 , 0.36079567, 0.38070046,
0.37690887, 0.32949014, 0.34213449, 0.01310333],
[ 0.00308052, 0.00762985, -0.00556496, -0.00185015, 0.00300425,
0.00169865, 0.01380971, 0.0142307 , -0.99974635],
[ 0.0136128 , 0.04651786, 0.76405944, 0.10212738, 0.04236969,
0.05690046, -0.47599931, -0.41419841, -0.01629199],
[-0.09045103, -0.27641087, 0.53709146, -0.55429524, 0.058524 ,
-0.19038107, 0.4397584 , 0.29430344, 0.00576399]])
import math
loadings = pca.components_.T * math.sqrt(pca.explained_variance_)
It gives me following error 'only length-1 arrays can be converted to Python scalars
I understand the problem. I have to traverse the pca.components_ and pca.explained_variance_ arrays such as:
##just a thought
Loading=np.empty((8,4))
for i,j in (pca.components_, pca.explained_variance_):
loading=i*math.sqrt(j)
Loading=Loading.append(loading)
##unable to proceed further
##something wrong here
This is simply a problem of mixing modules. For numpy arrays, use np.sqrt instead of math.sqrt (which only works on single values, not arrays).
Your last line should thus read:
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)
This is a mistake in the original answers you linked to. I have edited them accordingly.

Gap Statistic Method

import sys
import numpy as np
import scipy.io as sio
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.svm import SVC
filename = sys.argv[1]
datafile = sio.loadmat(filename)
data = datafile['bow']
sizedata=[len(data), len(data[0])]
gap=[]
SD=[]
for knum in xrange(10,20):
print knum
#Clustering original Data
kmeanspp = KMeans(n_clusters=knum,init = 'k-means++',max_iter = 100,n_jobs = 1)
kmeanspp.fit(data)
dispersion = kmeanspp.inertia_
#Clustering Reference Data
nrefs = 10
refDisp = np.zeros(nrefs)
for nref in xrange(nrefs):
refdata = np.random.random_sample((sizedata[0],sizedata[1]))
refkmeans = KMeans(n_clusters=knum,init='k-means++',max_iter=100,n_jobs=1)
refkmeans.fit(refdata)
refdisp = refkmeans.inertia_
refDisp[nref]=np.log(refdisp)
mean_log_refdisp = np.mean(refDisp)
gap.append(mean_log_refdisp-np.log(dispersion))
#Calculating standard deviaiton
sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
SD.append(sd)
SD = [sd*((1+(1/nrefs))**0.5) for sd in SD]
#determining optimal k
opt_k = None
diff = []
for i in xrange(len(gap)-1):
diff = (SD[i+1]-(gap[i+1]-gap[i]))
if diff>0:
opt_k = i+10
break
print diff
plt.plot(np.linspace(10,19,10,True),gap)
plt.show()
Here I am trying to implement the Gap Statistic method for determining the optimal number of clusters. But the problem is that every time I run the code I get a different value for k.
What is the solution to the problem?
How can the value of optimal k differ for the same data?
I have stored the data in a .mat file beforehand and I am passing it as an argument via terminal
I am looking for the smallest value of k for which Gap(k)>= Gap(k+1)-s(k+1) where s(k+1) = sd(k+1)*square_root(1+(1/B)) where sd is the standard deviation of the reference distribution and B is the number of copies of Monte Carlo sample
Otherwise stated, I am searching for the value of k for which
s(k+1)-Gap(k+1)+Gap(k)>=0
Couple of problems with your simulation:
1- sd = (sum([(r-m)**2 for r,m in zip(refDisp,[mean_log_refdisp]*nrefs)])/nrefs)**0.5
Why did you multiply the second component of zip by nrefs that is not needed according to the original paper.
2-
if diff>0:
opt_k = i+10
break
if diff>0 you want diff>=0 since equality can happen a
About why you get different number of clusters each time, as people said it is monte carlo simulation so there can be randomness and also it depends on what you are clustering and your dataset. I suggest you to test your algorithms against Silhouette and Elbow to get a better idea about number of clusters.
One option is to run your function several times and then average the gap statistics and the s values, and find the smallest k where the average s(k+1)-Gap(k+1)+Gap(k) is greater than
This will take longer but give a more reliable result.

How to get predictive attributes of each target in `Random Forest`?

I've been messing around with Random Forest models lately and they are really useful w/ the feature_importance_ attribute!
It would be useful to know which variables are more predictive of particular targets.
For example, what if the 1st and 2nd attributes were more predictive of distringuishing target 0 but the 3rd and 4th attributes were more predictive of target 1?
Is there a way to get the feature_importance_ array for each target separately? With sklearn, scipy, pandas, or numpy preferably.
# Iris dataset
DF_iris = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
Se_iris = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Import modules
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
# Split Data
X_tr, X_te, y_tr, y_te = train_test_split(DF_iris, Se_iris, test_size=0.3, random_state=0)
# Create model
Mod_rf = RandomForestClassifier(random_state=0)
Mod_rf.fit(X_tr,y_tr)
# Variable Importance
Mod_rf.feature_importances_
# array([ 0.14334485, 0.0264803 , 0.40058315, 0.42959169])
# Target groups
Se_iris.unique()
# array([0, 1, 2])
This is not really how RF works. Since there is no simple "feature voting" (which takes place in linear models) it is really hard to answer the question what "feature X is more predictive for target Y" even means. What feature_importance of RF captures is "how probable is, in general, to use this feature in the decision process". The problem with addressing your question is that if you ask "how probable is, in general, to use this feature in decision process leading to label Y" you would have to pretty much run the same procedure but remove all subtrees which do not contain label Y in a leaf - this way you remove parts of the decision process which do not address the problem "is it Y or not Y" but rather try to answer which "not Y" it is. However, in practice, due to very stochastic nature of RF, cutting its depth etc. this might barely reduce anything. The bad news is also, that I never seen it implemented in any standard RF library, you could do this on your own, just the way I said:
for i = 1 to K (K is number of distinct labels)
tmp_RF = deepcopy(RF)
for tree in tmp_RF:
tree = remove_all_subtrees_that_do_not_contain_given_label(tree, i)
for x in X (X is your dataset)
features_importance[i] += how_many_times_each_feature_is_used(tree, x) / |X|
features_importance[i] /= |tmp_RF|
return features_importance
in particular you could use existing feature_importance codes, simply by doing
for i = 1 to K (K is number of distinct labels)
tmp_RF = deepcopy(RF)
for tree in tmp_RF:
tree = remove_all_subtrees_that_do_not_contain_given_label(tree, i)
features_importance[i] = run_regular_feature_importance(tmp_RF)
return features_importance

Why are LASSO in sklearn (python) and matlab statistical package different?

I am using LaasoCV from sklearn to select the best model is selected by cross-validation. I found that the cross validation gives different result if I use sklearn or matlab statistical toolbox.
I used matlab and replicate the example given in
http://www.mathworks.se/help/stats/lasso-and-elastic-net.html
to get a figure like this
Then I saved the matlab data, and tried to replicate the figure with laaso_path from sklearn, I got
Although there are some similarity between these two figures, there are also certain differences. As far as I understand parameter lambda in matlab and alpha in sklearn are same, however in this figure it seems that there are some differences. Can somebody point out which is the correct one or am I missing something? Further the coefficient obtained are also different (which is my main concern).
Matlab Code:
rng(3,'twister') % for reproducibility
X = zeros(200,5);
for ii = 1:5
X(:,ii) = exprnd(ii,200,1);
end
r = [0;2;0;-3;0];
Y = X*r + randn(200,1)*.1;
save randomData.mat % To be used in python code
[b fitinfo] = lasso(X,Y,'cv',10);
lassoPlot(b,fitinfo,'plottype','lambda','xscale','log');
disp('Lambda with min MSE')
fitinfo.LambdaMinMSE
disp('Lambda with 1SE')
fitinfo.Lambda1SE
disp('Quality of Fit')
lambdaindex = fitinfo.Index1SE;
fitinfo.MSE(lambdaindex)
disp('Number of non zero predictos')
fitinfo.DF(lambdaindex)
disp('Coefficient of fit at that lambda')
b(:,lambdaindex)
Python Code:
import scipy.io
import numpy as np
import pylab as pl
from sklearn.linear_model import lasso_path, LassoCV
data=scipy.io.loadmat('randomData.mat')
X=data['X']
Y=data['Y'].flatten()
model = LassoCV(cv=10,max_iter=1000).fit(X, Y)
print 'alpha', model.alpha_
print 'coef', model.coef_
eps = 1e-2 # the smaller it is the longer is the path
models = lasso_path(X, Y, eps=eps)
alphas_lasso = np.array([model.alpha for model in models])
coefs_lasso = np.array([model.coef_ for model in models])
pl.figure(1)
ax = pl.gca()
ax.set_color_cycle(2 * ['b', 'r', 'g', 'c', 'k'])
l1 = pl.semilogx(alphas_lasso,coefs_lasso)
pl.gca().invert_xaxis()
pl.xlabel('alpha')
pl.show()
I do not have matlab but be careful that the value obtained with the cross--validation can be unstable. This is because it influenced by the way you subdivide the samples.
Even if you run 2 times the cross-validation in python you can obtain 2 different results.
consider this example :
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
kf=sklearn.cross_validation.KFold(len(y),n_folds=10,shuffle=True)
cv=sklearn.linear_model.LassoCV(cv=kf,normalize=True).fit(x,y)
print cv.alpha_
0.00645093258722
0.00691712356467
it's possible that alpha = lambda / n_samples
where n_samples = X.shape[0] in scikit-learn
another remark is that your path is not very piecewise linear as it could/should be. Consider reducing the tol and increasing max_iter.
hope this helps
I know this is an old thread, but:
I'm actually working on piping over to LassoCV from glmnet (in R), and I found that LassoCV doesn't do too well with normalizing the X matrix first (even if you specify the parameter normalize = True).
Try normalizing the X matrix first when using LassoCV.
If it is a pandas object,
(X - X.mean())/X.std()
It seems you also need to multiple alpha by 2
Though I am unable to figure out what is causing the problem, there is a logical direction in which to continue.
These are the facts:
Mathworks have selected an example and decided to include it in their documentation
Your matlab code produces exactly the result as the example.
The alternative does not match the result, and has provided inaccurate results in the past
This is my assumption:
The chance that mathworks have chosen to put an incorrect example in their documentation is neglectable compared to the chance that a reproduction of this example in an alternate way does not give the correct result.
The logical conclusion: Your matlab implementation of this example is reliable and the other is not.
This might be a problem in the code, or maybe in how you use it, but either way the only logical conclusion would be that you should continue with Matlab to select your model.

Categories