Imagine I have training data with 9 dimension and 6000 sample, and I applied PCA algorithm using sklearn PCA.
I reduce it's dimensions to 4, and know I want convert one new sample with 9 features to my training data space with 4 components as fast as possible.
here is my first pca code:
X_std = StandardScaler().fit_transform(df1)
pca = PCA(n_components = 4)
result = pca.fit_transform(X_std)
Is there any way do this with sklearn pca function?
If you want to transform the original matrix to the reduced dimensionality projection offered by PCA you can use the transform function which will run an efficient inner-product on the eigenvectors and the input matrix:
pca = PCA(n_components=4)
pca.fit(X_train)
X_std_reducted = pca.transform(X_std)
From the scikit source:
X_transformed = fast_dot(X, self.components_.T)
So applying the PCA transformation is simply a linear combination -- very fast. Now you can apply the projection to the training set and any new data that we want to tests against in the future.
This article describes the process in more detail: http://www.eggie5.com/69-dimensionality-reduction-using-pca
Related
I'm using the StandardScalar() and lin_reg.coef_ function in the following context:
for i in range(100):
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=i)
scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
lin_reg = LinearRegression().fit(x_train, y_train)
if i == 0:
print(lin_reg.coef_)
if i == 1:
print(lin_reg.coef_)
This leads to the following output:
Code Output
So, as have been expected, the coef_ function returns the coefficients for the 22 different features I am passing into the linear regression. However, for the second output, some of the coefficients are way too large (e.g. 1.61e+14). I am pretty sure that the scaling with StandardScaler() works as it should be. However, if I do not scale the training data before applying the coef_ function, I do not get these high coefficients. One important thing that I should mention is that the last 13 features are binary features, whereas the first 9 features are continuous (such as age). I can imagine that the problem is somehow related to this fact, although, for the first binary feature, the coefficients are properly computed (just the last 12 binary features have too large coefficients).
You should use Standardization when the data come from a Gaussian distribution. Using StandardScal() on binary data doesn't make any sense.
You should scale only the first 9 nine variables, and then pass them all in the linear regression.
https://www.atoti.io/when-to-perform-a-feature-scaling/
Avoid scaling binary columns in sci-kit learn StandsardScaler
I was reading this post Recovering features names of explained_variance_ratio_ in PCA with sklearn and I wanted to understand the output of the following line of code:
pd.DataFrame(pca.components_, columns=subset.columns)
First, I thought that pca components from sklearn would be how much of the variance is explained by each feature (I guess this is the interpretation of PCA, right?). However, I think that this is actually wrong, and the explained variance is given by pca.explained_variance.
Also, the ouput of the dataframe constructed with the script above is very confused to me, because it has several lines and there are also negative numbers.
Furthemore, how does the dataframe constructed above relates to the following plot:
plt.bar(range(pca.explained_variance_), pca.explained_variance_)
I'm really confused about the PCA components and the variance.
If some example is needed, we might build PCA with iris dataset. This is what I've done so far:
subset = iris.iloc[:, 1:5]
scaler = StandardScaler()
pca = PCA()
pipe = make_pipeline(scaler, pca)
pipe.fit(subset)
# Plot the explained variances
features = range(pca.n_components_)
_ = plt.bar(features, pca.explained_variance_)
# Dump components relations with features:
pd.DataFrame(pca.components_, columns=subset.columns)
In PCA, the components (in sklearn, the components_) are linear combinations between the original features, enhancing their variance. So, their are vectors that combine the input features, in order to maximize the variance.
In sklearn, as referenced here, the components_ are presented in order of their explained variance (explained_variance_), from the highest to the lowest value. So, the i-th vector of components_ has the i-th value of explained_variance_.
A useful link on PCA: https://online.stat.psu.edu/stat505/lesson/11
I have applied CountVectorizer() on my X_train and it returned a sparse matrix.
Usually if we want to Standardize sparse matrix we pass in with_mean=False param.
scaler = StandardScaler(with_mean=False)
X_train = scaler.fit_transform()
But In my case after applying CountVectorizer on my X_train I have also performed PCA(TruncatedSVD) to reduce dimensions. Now my data is not a sparse matrix.
So now can I apply StandardScaler() directly without passing with_mean=False (i.e with_mean=True)?
If you take a look at what the with_mean parameter does, you'll find that it simply centers your data before scaling. The reason why you don't center a sparse matrix is because when you try to center a sparse matrix it will get transformed into a dense matrix and will occupy much more memory, thus destroying its sparsity in the first place.
After you perform PCA your data has reduced dimensions and can now be centered before scaling. So yes, you can apply StandardScaler() directly.
I am trying to do a dimension reduction using PCA from scikit-learn. My data set has around 300 samples and 4096 features. I want to reduce the dimensions to 400 and 40. But when I call the algorithm the resulting data does have at most "number of samples" features.
from sklearn.decomposition import PCA
pca = PCA(n_components = 400)
trainData = pca.fit_transform(trainData)
testData = pca.transform(testData)
Where initial shape of trainData is 300x4096 and the resulting data shape is 300x300. Is there any way to perform this operation on this kind of data (lot of features, few samples)?
The maximum number of principal components that can be extracted from and M x N dataset is min(M, N). Its not an algorithm issue. Fundamentally, that is the maximum number that there are.
I've a set of 4k text documents.
They belong to 10 different classes.
I'm trying to see how random forest method performs classification.
The issue is my feature extraction class extracts 200k features.(A combination of words,bigrams,collocations etc.)
This is highly sparse data and random forest implementation in sklearn does not work with sparse data inputs.
Q. What are my options here? Reduce number of features ? How ?
Q. Is there any implementation of random forest out there which work with sparse array.
My relevant code is as follows:
import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
#import pylab as pl
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from special_analyzer import *
data_train = load_files(RAW_DATA_SRC_TR)
data_test = load_files(RAW_DATA_SRC_TS)
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target
vectorizer = CountVectorizer( analyzer=SpecialAnalyzer()) # SpecialAnalyzer is my class extracting features from text
X_train = vectorizer.fit_transform(data_train.data)
rf = RandomForestClassifier(max_depth=10,max_features=10)
rf.fit(X_train,y_train)
Several options: take only the most 10000 most popular features by passing max_features=10000 to CountVectorizer and convert the results to a dense numpy array with the to array method:
X_train_array = X_train.toarray()
Otherwise reduce the dimensionality to 100 or 300 dimensions with:
pca = TruncatedSVD(n_components=300)
X_reduced_train = pca.fit_transform(X_train)
However in my experience I could never make a RF work better than a well tuned linear model (such as logistic regression with grid searched regularization parameter) on the original sparse data (possibly with TF-IDF normalization).
Option 1:
"If the number of variables is very large, forests can be run once with all the variables, then run again using only the most important variables from the first run."
from: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#giniimp
I'm not sure about the random forest in sklearn has a feature importance option. The random forest in R implements mean decrease in gini impurity as well as mean decrease in accuracy.
Option 2:
Do dimensionality reduction. Use PCA or another dimension reduction technique to change the dense matrix of N dimensions into a smaller matrix and then use this smaller less sparse matrix for the classification problem
Option 3:
Drop correlated features. I believe the random forest is supposed to be more robust to correlated features compared to multinomial logistic regression. That being said... it could be the case that you have a number of correlated features. If you have a lot of pairwise correlated variables, you can drop one of the two variables and you should in theory not lose "predictive power". In addition to pairwise correlation there is also multiple correlations. Check out: http://en.wikipedia.org/wiki/Variance_inflation_factor