How should I interpret the output of pca.components_

How should I interpret the output of pca.components_ - python

I was reading this post Recovering features names of explained_variance_ratio_ in PCA with sklearn and I wanted to understand the output of the following line of code:
pd.DataFrame(pca.components_, columns=subset.columns)
First, I thought that pca components from sklearn would be how much of the variance is explained by each feature (I guess this is the interpretation of PCA, right?). However, I think that this is actually wrong, and the explained variance is given by pca.explained_variance.
Also, the ouput of the dataframe constructed with the script above is very confused to me, because it has several lines and there are also negative numbers.
Furthemore, how does the dataframe constructed above relates to the following plot:
plt.bar(range(pca.explained_variance_), pca.explained_variance_)
I'm really confused about the PCA components and the variance.
If some example is needed, we might build PCA with iris dataset. This is what I've done so far:
subset = iris.iloc[:, 1:5]
scaler = StandardScaler()
pca = PCA()
pipe = make_pipeline(scaler, pca)
pipe.fit(subset)
# Plot the explained variances
features = range(pca.n_components_)
_ = plt.bar(features, pca.explained_variance_)
# Dump components relations with features:
pd.DataFrame(pca.components_, columns=subset.columns)

In PCA, the components (in sklearn, the components_) are linear combinations between the original features, enhancing their variance. So, their are vectors that combine the input features, in order to maximize the variance.
In sklearn, as referenced here, the components_ are presented in order of their explained variance (explained_variance_), from the highest to the lowest value. So, the i-th vector of components_ has the i-th value of explained_variance_.
A useful link on PCA: https://online.stat.psu.edu/stat505/lesson/11

Related

Random Forest Regressor Feature Importance all zero

I'm running a random forest regressor using scikit learn, but all the predictions end up being the same.
I realized that when I fit the data, all the feature importance are zero which is probably why all the predictions are the same.
This is the code that I'm using:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
merged_df = pd.read_csv("/home/jovyan/efs/vliu/combined_data.csv")
target = merged_df["400kmDensity"]
merged_df.drop("400kmDensity", axis = 1, inplace = True)
features_list = list(merged_df.columns)
#Set training and testing groups
train_features, test_features, train_target, test_target = train_test_split(merged_df, target, random_state = 16)
#Train model
rf = RandomForestRegressor(n_estimators = 150, random_state = 16)
ran = rf.fit(train_features, train_target)
print("Feature importances: ", rf.feature_importances_)
#Make predictions and calculate error
predictions = ran.predict(test_features)
print("Predictions: ", predictions)
Here's a link to the data file:
https://drive.google.com/file/d/1ECgKAH82wxIvt2OCv4W5ir1te_Vr3N_r/view?usp=sharing
If anybody can see what I did wrong before fitting the data that would result in the feature importances all being zero, that would be much appreciated.

Both your variables "400kmDensity" and "410kmDensity" share a correlation coefficient of >0.99:
np.corrcoef(merged_df["400kmDensity"], merged_df["410kmDensity"])
This practically means that you can predict "400kmDensity" almost exclusively with "410kmDensity". On a scatter plot they form an almost perfect line:
In order to actually explore what affects the values of "400kmDensity", you should exclude "410kmDensity" as a regressor (an explanatory variable). The feature importance can help to identify explanatory variables afterward.
Note that feature importance may not be a perfect metric to determine actual feature importance. Maybe you want to take a look into other available methods like Boruta Algorithm/Permutation Importance/...
In regard to the initial question: I'm not really sure why, but RandomForestRegressor seems to have a problem with your very low target variable(?). I was able to get feature importances after I scaled train_target and train_features in rf.fit(). However, this should not actually be necessary at all in order to apply Random Forest! You maybe want to take a look into the respective documentation or extend your search in this direction. Hope this serves as a hint.
fitted.rf = rf.fit(scale(train_features), scale(train_target))
As mentioned before, the feature importances after this change unsurprisingly look like this:
Also, the column "second" holds only the value zero, which does not explain anything! Your first step should be always EDA (Explanatory Data Analysis) to get a feeling for the data, like checking correlations between columns or generating histograms in order to explore data distributions [...].
There is much more to it, but I hope this gives you a leg-up!

How to plot feature_importance for DecisionTreeClassifier?

I need to plot feature_importances for DecisionTreeClassifier. Features are already found and target results are achieved, but my teacher tells me to plot feature_importances to see weights of contributing factors.
I have no idea how to do it.
model = DecisionTreeClassifier(random_state=12345, max_depth=8,class_weight='balanced')
model.fit(features_train,target_train)
model.feature_importances_
It gives me.
array([0.02927077, 0.3551379 , 0.01647181, ..., 0.03705096, 0. ,
0.01626676])
Why it is not attached to anything like max_depth and just an array of some numbers?

Feature importances represent the affect of the factor to the outcome variable. The greater it is, the more it affects the outcome. That's why you received the array.
For plotting, you can do:
import matplotlib.pyplot as plt
feat_importances = pd.DataFrame(model.feature_importances_, index=features_train.columns, columns=["Importance"])
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)
feat_importances.plot(kind='bar', figsize=(8,6))

Feature importance refers to a class of techniques for assigning scores to input features to a predictive model that indicates the relative importance of each feature when making a prediction.
Feature importance scores can be calculated for problems that involve predicting a numerical value, called regression, and those problems that involve predicting a class label, called classification.
Load the feature importances into a pandas series indexed by your dataframe column names, then use its plot method.
From Scikit Learn
Feature importances are provided by the fitted attribute feature_importances_ and they are computed as the mean and standard deviation of accumulation of the impurity decrease within each tree.
How are feature_importances in RandomForestClassifier determined?
For your example:
feat_importances = pd.Series(model.feature_importances_, index=df.columns)
feat_importances.nlargest(5).plot(kind='barh')
More ways to plot Feature Importances- Random Forest Feature Importance Chart using Python

Why use LSA before K-Means when doing text clustering

I'm following this tutorial from Scikit learn on text clustering using K-Means:
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
In the example, optionally LSA (using SVD) is used to perform dimensionality reduction.
Why is this useful ? The number of dimensions (features) can already be controlled in the TF-IDF vectorizer using the "max_features" parameter.
I understand that LSA (and LDA) are also topic modelling techniques. The difference with clustering is that documents belong to multiple topics, but only to one cluster. I do not understand why LSA would be used in the context of K-Means clustering.
Example code:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
documents = ["some text", "some other text", "more text"]
tfidf_vectorizer = TfidfVectorizer(max_df=0.5, max_features=10000, min_df=2, stop_words='english', use_idf=True)
X = tfidf_vectorizer.fit_transform(documents)
svd = TruncatedSVD(1000)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)
Xnew = lsa.fit_transform(X)
model = KMeans(n_clusters=10, init='k-means++', max_iter=100, n_init=1, verbose=False)
model.fit(Xnew)

LSA transforms the bag-of-words feature space to a new feature-space (with ortho-normal set of basis vectors) where each dimension represents a latent concept (represented as a linear combination of words in the original dimension).
As with PCA, a few top eigenvectors generally capture most of the variance in the transformed feature space and the other eigenvectors mainly represent the noise in the dataset, hence, the top eigenvectors in the LSA featurespace can be thought of to be likely to capture most of the concepts defined by the words in the original space.
Hence, reduction in dimension in the transofrmed LSA feature space is likely to be much more effective than in the original BOW tf-idf feature space (which simply chops off the less frequent / unimportant words), thereby leading to better quality data after the dimensionality reduction and likely to improve the quality of the clusters.
Additionally, dimension reduction helps to fight the curse of dimensionality problem (e.g., that arises while the distance computation in k-means).

There is a paper that shows that PCA eigenvectors are good initializers for K-Means.
Controlling the dimension with the max_features parameter is equivalent to cutting off the vocabulary size which has negative effects. For example if you set max_features to 10 the model will work with the most common 10 words in the corpus and ignore the rest.

Add new vector to PCA new space data python

Imagine I have training data with 9 dimension and 6000 sample, and I applied PCA algorithm using sklearn PCA.
I reduce it's dimensions to 4, and know I want convert one new sample with 9 features to my training data space with 4 components as fast as possible.
here is my first pca code:
X_std = StandardScaler().fit_transform(df1)
pca = PCA(n_components = 4)
result = pca.fit_transform(X_std)
Is there any way do this with sklearn pca function?

If you want to transform the original matrix to the reduced dimensionality projection offered by PCA you can use the transform function which will run an efficient inner-product on the eigenvectors and the input matrix:
pca = PCA(n_components=4)
pca.fit(X_train)
X_std_reducted = pca.transform(X_std)
From the scikit source:
X_transformed = fast_dot(X, self.components_.T)
So applying the PCA transformation is simply a linear combination -- very fast. Now you can apply the projection to the training set and any new data that we want to tests against in the future.
This article describes the process in more detail: http://www.eggie5.com/69-dimensionality-reduction-using-pca

Input format for logistic regression in scikit-learn as in R

When Using logistic regression in R, the data input for the 'glm' function (family = binomial) can be: (?family) in several formats, and specifically in the format of:
......
For the binomial and quasibinomial families the response can be
specified in one of three ways:
......
As a numerical vector with values between 0 and 1, interpreted as the
proportion of successful cases (with the total number of cases given
by the weights)....
I have aggregated data that represents proportion of success out of trials (number between 0 and 1) and their equivalent weights, I'm interested in applying logistic regression with it, which would be trivial to use in R.
Unfortunately i cant use R in this project, and would like to use scikit-learn to estimate the logistic regression coefficients . More precise, i'm looking to apply the sklearn.linear_model.LogisticRegression in a form of input that will allow me to insert the model proportions and wights, in a similar fashion as available in R.
example:
from sklearn import linear_model
import pandas as pd
df = pd.DataFrame([[1,1,1,0], [1,1,1,0],[1,1,1,1],[2,2,1,1] , [2,2,1,1],[2,2,1,0] , [3,3,1,0] ],columns=['a', 'b','Trials','Success'])
logistic = linear_model.LogisticRegression()
#this works
logistic.fit(X=df[['a','b','Trials']] , y=df.Success)
logistic.predict_proba(df[['a','b','Trials']])
prob_to_success = logistic.predict_proba(df[['a','b','Trials']])[:,1]
prob_to_success
Out[51]: array([ 0.45535843, 0.45535843, 0.45535843, 0.42212169, 0.42212169,
0.42212169, 0.38957565])
#How can i use the following Data?
df_agg = df.groupby(['a','b'] , as_index=False)['Trials','Success'].sum()
df_agg["Prop"] = df_agg.Success / (df_agg.Trials)
df_agg
#I want to use Prop & Trials as weights in df_agg
Thanks in advance!

Convert to log-odds form and use linear regression on the transformation. Sklearn doesn't seem to have a quasi-binomial conversion for logistic regression. As you said, trivial in R but sklearn seems to not have anything of the sort.

If you want to use weights, you can use them in the fit function of LogisticRegression:
fit(X, y, sample_weight=None)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.