I am trying to do a dimension reduction using PCA from scikit-learn. My data set has around 300 samples and 4096 features. I want to reduce the dimensions to 400 and 40. But when I call the algorithm the resulting data does have at most "number of samples" features.
from sklearn.decomposition import PCA
pca = PCA(n_components = 400)
trainData = pca.fit_transform(trainData)
testData = pca.transform(testData)
Where initial shape of trainData is 300x4096 and the resulting data shape is 300x300. Is there any way to perform this operation on this kind of data (lot of features, few samples)?
The maximum number of principal components that can be extracted from and M x N dataset is min(M, N). Its not an algorithm issue. Fundamentally, that is the maximum number that there are.
Related
Really been on this for over a week now,worked with youtube likes prediction dataset. i had to drop all non textual features and non correlated features to the target remaining 3 features,and the dataset is just (26061,12) dataset.
But using linear regression saw that my MSE was too huge and also the MAE(about 15,000). Also used gradientboosting still the same and also discovered it doesn't work for the dataset any value greater than 5 for the n_estimators. Also tried to transform the X_train and X_test using power transformer to ensure a good gaussian distribution but still didn't work.
I can't figure what is really wrong.
here's link to my colab notebook https://colab.research.google.com/drive/1dJZuG0n63842DEwHMR7TzLBmssnOKsj4?usp=sharing
link to dataset https://www.kaggle.com/jinxzed/youtube-likes-prediction-av-hacklive
The scales of your features are very different ('views' are numerically much larger than the other variables). This makes the 'views' feature to have higher influence to the final output than other variables do.
I'd recommend normalizing the features first before feeding the data into any model. You can use sklearn's StandardScaler:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Also, a large MSE doesn't necessarily mean your model is bad since how large MSE also depends on how large the label y is. For example, for the same 10% difference between true_label and predict_label,
true_label = 1000, predict_label = 1100 -> Squared Error = 10000
would result in a much larger Squared Error than
true_label = 1, predict_label = 1.1 -> Squared Error = 0.01
I was reading this post Recovering features names of explained_variance_ratio_ in PCA with sklearn and I wanted to understand the output of the following line of code:
pd.DataFrame(pca.components_, columns=subset.columns)
First, I thought that pca components from sklearn would be how much of the variance is explained by each feature (I guess this is the interpretation of PCA, right?). However, I think that this is actually wrong, and the explained variance is given by pca.explained_variance.
Also, the ouput of the dataframe constructed with the script above is very confused to me, because it has several lines and there are also negative numbers.
Furthemore, how does the dataframe constructed above relates to the following plot:
plt.bar(range(pca.explained_variance_), pca.explained_variance_)
I'm really confused about the PCA components and the variance.
If some example is needed, we might build PCA with iris dataset. This is what I've done so far:
subset = iris.iloc[:, 1:5]
scaler = StandardScaler()
pca = PCA()
pipe = make_pipeline(scaler, pca)
pipe.fit(subset)
# Plot the explained variances
features = range(pca.n_components_)
_ = plt.bar(features, pca.explained_variance_)
# Dump components relations with features:
pd.DataFrame(pca.components_, columns=subset.columns)
In PCA, the components (in sklearn, the components_) are linear combinations between the original features, enhancing their variance. So, their are vectors that combine the input features, in order to maximize the variance.
In sklearn, as referenced here, the components_ are presented in order of their explained variance (explained_variance_), from the highest to the lowest value. So, the i-th vector of components_ has the i-th value of explained_variance_.
A useful link on PCA: https://online.stat.psu.edu/stat505/lesson/11
from tensorflow.examples.tutorials.mnist import input_data
mnist=input_data.read_data_sets('data/MNIST/', one_hot=True)
numpy implementation
# Entire Data set
Data=np.array(mnist.train.images)
#centering the data
mu_D=np.mean(Data, axis=0)
Data-=mu_D
COV_MA = np.cov(Data, rowvar=False)
eigenvalues, eigenvec=scipy.linalg.eigh(COV_MA, eigvals_only=False)
together = zip(eigenvalues, eigenvec)
together = sorted(together, key=lambda t: t[0], reverse=True)
eigenvalues[:], eigenvec[:] = zip(*together)
n=3
pca_components=eigenvec[:,:n]
print(pca_components.shape)
data_reduced = Data.dot(pca_components)
print(data_reduced.shape)
data_original = np.dot(data_reduced, pca_components.T) # inverse_transform
print(data_original.shape)
plt.imshow(data_original[10].reshape(28,28),cmap='Greys',interpolation='nearest')
sklearn implementation
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(Data)
data_reduced = np.dot(Data, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # inverse_transform
plt.imshow(data_original[10].reshape(28,28),cmap='Greys',interpolation='nearest')
I'd like to implement PCA algorithms by using numpy. However I don't know how to reconstruct the images from that and I don't even know if this code is correct.
Actually, when I used sklearn.decomposition.PCA, the result is different from the numpy implementation.
Can you explain the differences?
I can spot a few differences already.
For one:
n=300
projections = only_2.dot(eigenvec[:,:n])
Xhat = np.dot(projections, eigenvec[:,:n].T)
Xhat += mu_D
plt.imshow(Xhat[5].reshape(28,28),cmap='Greys',interpolation='nearest')
The point I'm trying to make is, if my understanding is correct n = 300, you are trying to fit 300 eigen vectors whose eigen values go from high to low.
But in sklearn
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
pca.fit(only_2)
data_reduced = np.dot(only_2, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # invers
It seems to me you are fitting just the FIRST component (the component that maximizes variance) and you're not taking all 300.
Further more:
One thing I can clearly, say is that you seem to understand what's happening in PCA but you're having trouble implementing it. Correct me if I'm wrong but:
data_reduced = np.dot(only_2, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # inverse_transform
In this part, you are trying to PROJECT your eigenvectors to your data which is what you should go about doing in PCA, but in sklearn, what you should do is the following:
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=300)
pca.fit_transform(only_2)
If you could tell me how you created only_2, I can give you a much more specific answer tomorrow.
Here is what sklearn says about fit_transform for PCA: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit_transform:
fit_transform(X, y=None)
Fit the model with X and apply the dimensionality reduction on X.
Parameters:
X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
y : Ignored
Returns:
X_new : array-like, shape (n_samples, n_components)
Imagine I have training data with 9 dimension and 6000 sample, and I applied PCA algorithm using sklearn PCA.
I reduce it's dimensions to 4, and know I want convert one new sample with 9 features to my training data space with 4 components as fast as possible.
here is my first pca code:
X_std = StandardScaler().fit_transform(df1)
pca = PCA(n_components = 4)
result = pca.fit_transform(X_std)
Is there any way do this with sklearn pca function?
If you want to transform the original matrix to the reduced dimensionality projection offered by PCA you can use the transform function which will run an efficient inner-product on the eigenvectors and the input matrix:
pca = PCA(n_components=4)
pca.fit(X_train)
X_std_reducted = pca.transform(X_std)
From the scikit source:
X_transformed = fast_dot(X, self.components_.T)
So applying the PCA transformation is simply a linear combination -- very fast. Now you can apply the projection to the training set and any new data that we want to tests against in the future.
This article describes the process in more detail: http://www.eggie5.com/69-dimensionality-reduction-using-pca
I've a set of 4k text documents.
They belong to 10 different classes.
I'm trying to see how random forest method performs classification.
The issue is my feature extraction class extracts 200k features.(A combination of words,bigrams,collocations etc.)
This is highly sparse data and random forest implementation in sklearn does not work with sparse data inputs.
Q. What are my options here? Reduce number of features ? How ?
Q. Is there any implementation of random forest out there which work with sparse array.
My relevant code is as follows:
import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
#import pylab as pl
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from special_analyzer import *
data_train = load_files(RAW_DATA_SRC_TR)
data_test = load_files(RAW_DATA_SRC_TS)
# split a training set and a test set
y_train, y_test = data_train.target, data_test.target
vectorizer = CountVectorizer( analyzer=SpecialAnalyzer()) # SpecialAnalyzer is my class extracting features from text
X_train = vectorizer.fit_transform(data_train.data)
rf = RandomForestClassifier(max_depth=10,max_features=10)
rf.fit(X_train,y_train)
Several options: take only the most 10000 most popular features by passing max_features=10000 to CountVectorizer and convert the results to a dense numpy array with the to array method:
X_train_array = X_train.toarray()
Otherwise reduce the dimensionality to 100 or 300 dimensions with:
pca = TruncatedSVD(n_components=300)
X_reduced_train = pca.fit_transform(X_train)
However in my experience I could never make a RF work better than a well tuned linear model (such as logistic regression with grid searched regularization parameter) on the original sparse data (possibly with TF-IDF normalization).
Option 1:
"If the number of variables is very large, forests can be run once with all the variables, then run again using only the most important variables from the first run."
from: http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#giniimp
I'm not sure about the random forest in sklearn has a feature importance option. The random forest in R implements mean decrease in gini impurity as well as mean decrease in accuracy.
Option 2:
Do dimensionality reduction. Use PCA or another dimension reduction technique to change the dense matrix of N dimensions into a smaller matrix and then use this smaller less sparse matrix for the classification problem
Option 3:
Drop correlated features. I believe the random forest is supposed to be more robust to correlated features compared to multinomial logistic regression. That being said... it could be the case that you have a number of correlated features. If you have a lot of pairwise correlated variables, you can drop one of the two variables and you should in theory not lose "predictive power". In addition to pairwise correlation there is also multiple correlations. Check out: http://en.wikipedia.org/wiki/Variance_inflation_factor