Can I standardize my PCA applied count vector? - python

I have applied CountVectorizer() on my X_train and it returned a sparse matrix.
Usually if we want to Standardize sparse matrix we pass in with_mean=False param.
scaler = StandardScaler(with_mean=False)
X_train = scaler.fit_transform()
But In my case after applying CountVectorizer on my X_train I have also performed PCA(TruncatedSVD) to reduce dimensions. Now my data is not a sparse matrix.
So now can I apply StandardScaler() directly without passing with_mean=False (i.e with_mean=True)?

If you take a look at what the with_mean parameter does, you'll find that it simply centers your data before scaling. The reason why you don't center a sparse matrix is because when you try to center a sparse matrix it will get transformed into a dense matrix and will occupy much more memory, thus destroying its sparsity in the first place.
After you perform PCA your data has reduced dimensions and can now be centered before scaling. So yes, you can apply StandardScaler() directly.

Related

How should I interpret the output of pca.components_

I was reading this post Recovering features names of explained_variance_ratio_ in PCA with sklearn and I wanted to understand the output of the following line of code:
pd.DataFrame(pca.components_, columns=subset.columns)
First, I thought that pca components from sklearn would be how much of the variance is explained by each feature (I guess this is the interpretation of PCA, right?). However, I think that this is actually wrong, and the explained variance is given by pca.explained_variance.
Also, the ouput of the dataframe constructed with the script above is very confused to me, because it has several lines and there are also negative numbers.
Furthemore, how does the dataframe constructed above relates to the following plot:
plt.bar(range(pca.explained_variance_), pca.explained_variance_)
I'm really confused about the PCA components and the variance.
If some example is needed, we might build PCA with iris dataset. This is what I've done so far:
subset = iris.iloc[:, 1:5]
scaler = StandardScaler()
pca = PCA()
pipe = make_pipeline(scaler, pca)
pipe.fit(subset)
# Plot the explained variances
features = range(pca.n_components_)
_ = plt.bar(features, pca.explained_variance_)
# Dump components relations with features:
pd.DataFrame(pca.components_, columns=subset.columns)
In PCA, the components (in sklearn, the components_) are linear combinations between the original features, enhancing their variance. So, their are vectors that combine the input features, in order to maximize the variance.
In sklearn, as referenced here, the components_ are presented in order of their explained variance (explained_variance_), from the highest to the lowest value. So, the i-th vector of components_ has the i-th value of explained_variance_.
A useful link on PCA: https://online.stat.psu.edu/stat505/lesson/11

Does sparse matrix work with MultinomialNB?

I have a BoW vectors of shape (100000, 56000) and I want to use MultinomialNB from scikit-learn for a classification task.
Does MultinomialNB take sparse matrix for fitting the data?
I can't seem to convert it into dense matrixtoarray() due to memory error. If NB classifier doesn't take sparse matrix are there any alternatives I could use for fitting the data without converting it into dense matrix?
From the documentation of MultinomialNB.fit (emphasis added):
fit(X, y, sample_weight=None)
Parameters:
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
Training vectors, where n_samples is the number of samples and n_features is the number of features.

Sklearn, inconsistent numbers of samples using sparse matrices

I'm using the sklearn SVC implementation for multiclass SVM.
My model is supposed to have multiple outputs so I'm using One-Hot-Enconding on my labels (MultiLabelBinarizer).
mlb = MultiLabelBinarizer(classes=classes, sparse_output=True)
y_train = mlb.transform(y_train)
mlb.fit(y_train)
This gives me a vector of labels per sample, y_train is a csr_matrix of shape (n_samples, n_classes) specifically (18171, 17).
My training set is in the form of a scipy csc_matrix of shape (n_samples, n_feature) specifically (18171, 1001).
m_o_SVC = MultiOutputClassifier(SVC(C=0.1, kernel='linear', probability=True), n_jobs=-1)
m_o_SVC.fit(X_train, y_train)
This trains several classifiers each with a slice of the labels.
But I get this Warning:
"DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel()."
And this error:
"Found input variables with inconsistent numbers of samples: [18171, 1]"
If I don't use a sparse matrix for the labels everything works but I am not sure whether using a dense label representation will cause the algorithm to work with dense matrices (with a loss of performance).
Also, I don't understand the problem since the shapes are consistent.
Is this a problem with sklearn?

Add new vector to PCA new space data python

Imagine I have training data with 9 dimension and 6000 sample, and I applied PCA algorithm using sklearn PCA.
I reduce it's dimensions to 4, and know I want convert one new sample with 9 features to my training data space with 4 components as fast as possible.
here is my first pca code:
X_std = StandardScaler().fit_transform(df1)
pca = PCA(n_components = 4)
result = pca.fit_transform(X_std)
Is there any way do this with sklearn pca function?
If you want to transform the original matrix to the reduced dimensionality projection offered by PCA you can use the transform function which will run an efficient inner-product on the eigenvectors and the input matrix:
pca = PCA(n_components=4)
pca.fit(X_train)
X_std_reducted = pca.transform(X_std)
From the scikit source:
X_transformed = fast_dot(X, self.components_.T)
So applying the PCA transformation is simply a linear combination -- very fast. Now you can apply the projection to the training set and any new data that we want to tests against in the future.
This article describes the process in more detail: http://www.eggie5.com/69-dimensionality-reduction-using-pca

scikit learn PCA dimension reduction - data lot of features and few samples

I am trying to do a dimension reduction using PCA from scikit-learn. My data set has around 300 samples and 4096 features. I want to reduce the dimensions to 400 and 40. But when I call the algorithm the resulting data does have at most "number of samples" features.
from sklearn.decomposition import PCA
pca = PCA(n_components = 400)
trainData = pca.fit_transform(trainData)
testData = pca.transform(testData)
Where initial shape of trainData is 300x4096 and the resulting data shape is 300x300. Is there any way to perform this operation on this kind of data (lot of features, few samples)?
The maximum number of principal components that can be extracted from and M x N dataset is min(M, N). Its not an algorithm issue. Fundamentally, that is the maximum number that there are.

Categories