I'm very new to PCA.
I have 11 X variables for my model. These are the X variable labels
x = ['Day','Month', 'Year', 'Rolling Average','Holiday Effect', 'Day of the Week', 'Week of the Year', 'Weekend Effect', 'Last Day of the Month', "Quarter" ]
This is the graph I generated from the explained variance. With the x axis being the principal component.
[ 3.47567089e-01 1.72406623e-01 1.68663799e-01 8.86739892e-02
4.06427375e-02 2.75054035e-02 2.26578769e-02 5.72892368e-03
2.49272688e-03 6.37160140e-05]
I need to know whether I have a good selection of features. And how can I know which feature contributions the most.
from sklearn import decomposition
pca = decomposition.PCA()
pca.fit(X_norm)
scores = pca.explained_variance_
Though I do NOT know the dataset, I recommend that you scale your features before using PCA (variance will be maximized along the axes). I think X_norm refers to that in your code.
By using PCA, we are targeting to reduce dimensionality. In order to do that, we will start with a feature space which includes all X variables in your case, and will end up a projection of that space which typically is a different feature (sub)space.
In practice, when you have correlations between features, PCA can help you to project that correlation to smaller dimensions.
Think about this, if I'm holding a paper on my desk with full of dots on it, do I need the 3rd dimension to represent that dataset? Probably not, since all the dots are on paper and could be represented in 2D space.
When you are trying to decide how many principal components you will use from your new feature space, you can look at explained variance and it will tell you how much information is there for each principal component.
When I look at the principal components in your data, I see that ~85% of the variance could be attributed to first 6 principal components.
You can also set n_components. For example if you use n_components=2, then your transformed dataset will have 2 features.
Related
I am unsure if this kind of question (related to PCA) is acceptable here or not.
However, it is suggested to do MEAN CENTER before PCA, as known. In fact, I have 2 different classes (Each different class has different participants.). My aim is to distinguish and classify those 2 classes. Still, I am not sure about MEAN CENTER that should be applied to the whole data set, or to each class.
Is it better to make it separately? (if it is, should PREPROCESSING STEPS also be separately as well?) or does it not make any sense?
PCA is just a rotation, optionally accompanied with a projection onto a lower-dimensional space. It finds axes of maximal variance (which happen to be the principal axes of inertia of your point cloud) and then rotates the dataset to align those axes with your coordinate's system. You get to decide how many such axes you'd like to retain, which means the rotation is then followed by projection onto the first k axes of greatest variance, with k the dimensionality of the representation space you'll have chosen.
With this in mind, again like for calculating axes of inertia, you could decide to look for such axes through the center of mass of your cloud (the mean), or through any arbitrary origin of choice. In the former case, you would mean-center your data, and in the latter you may translate the data to any arbitrary point, with the result being to diminish the importance of the intrinsic cloud shape itself and increase the importance of the distance between the center of mass and the arbitrary point. Thus, in practice, you would almost always center your data.
You may also want to standardize your data (center and divide by standard deviation so as to make variance 1 on each coordinate), or even whiten your data.
In any case, you will want to apply the same transformations to the entire dataset, not class by class. If you were to apply the transformation class by class, whatever distance exists between the centers of gravity of each would be reduced to 0, and you would likely observe a collapsed representation with the two classes as overlapping. This may be interesting if you want to observe the intrinsic shape of each class, but then you would also apply PCA separately for each class.
Please note that PCA may make it easier for you to visualize the two classes (without guarantees, if the data are truly n-dimensional without much of a lower-dimensional embedding). But in no circumstances would it make it easier to discriminate between the two. If anything, PCA will reduce how discriminable your classes are, and it is often the case that the projection will intermingle classes (increase ambiguity) that are otherwise quite distinct and e.g. separable with a simple hyper-surface.
PCA is more or less per definition a SVD with centering of the data.
Depending on the implementation (if you use a PCA from a library) the centering is applied automatically e.g. sklearn - because as said it has to be centered by definition.
So for sklearn you do not need this preprocessing step and in general you apply it over your whole data.
PCA is unsupervised can be used to find a representation that is more meaningful and representative for you classes afterwards. So you need all your samples in the same feature space via the same PCA.
In short: You do the PCA once and over your whole (training) data and must be center over your whole (traning) data. Libraries like sklarn do the centering automatically.
The k neareast neighbor will help you distinquish between the two classes. Also try tsne to visualize data classes using higher dimensions.
def pca_classifier(X, y, n_components=2, n_neighbors=1):
"""
X: numpy array of shape (n_samples, n_features)
y: numpy array of shape (n_samples, )
n_components: int, number of components to keep
n_neighbors: int, number of neighbors to use in the knn classifier
"""
# 1. PCA
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X)
# 2. KNN
knn = KNeighborsClassifier(n_neighbors=n_neighbors)
knn.fit(X_pca, y)
# 3. plot
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap=plt.cm.Set1, edgecolor='k')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA')
plt.show()
return knn
So I had to create a linear regression in python, but this dataset has over 800 columns. Is there anyway to see what columns are contributing most to the linear regression model? Thank you.
Look at the coefficients for each of the features. Ignore the sign of the coefficient:
A large absolute value means the feature is heavily contributing.
A value close to zero means the feature is not contributing much.
A value of zero means the feature is not contributing at all.
You can measure the correlation between each independent variable and dependent variable, for example:
corr(X1, Y)
corr(X2, Y)
.
.
.
corr(Xn, Y)
and you can test the model selecting the N most correlated variable.
There are more sophisticated methods to perform dimensionality reduction:
PCA (Principal Component Analysis)
(https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c)
Forward Feature Construction
Use XGBoost in order to measure feature importance for each variable and then select the N most important variables
(How to get feature importance in xgboost?)
There are many ways to perform this action and each one has pros and cons.
https://machinelearningmastery.com/dimensionality-reduction-algorithms-with-python/
If you are just looking for variables with high correlation I would just do something like this
import pandas as pd
cols = df.columns
for c in cols:
# Set this to whatever you would like
if df['Y'].corr(df[c]) > .7:
print(c, df['Y'].corr(df[c]))
after you have decided what threshold/columns you want you can append c to a list
I am trying to familiarize with Doc2Vec results by using a public dataset of movie reviews. I have cleaned the data and run the model. There are, as you can see below, 6 tags/genres. Each is a document with its vector representation.
doc_tags = list(doc2vec_model.docvecs.doctags.keys())
print(doc_tags)
X = doc2vec_model[doc_tags]
print(X)
['animation', 'fantasy', 'comedy', 'action', 'romance', 'sci-fi']
[[ -0.6630892 0.20754902 0.2949621 0.622197 0.15592825]
[ -1.0809666 0.64607996 0.3626246 0.9261689 0.31883526]
[ -2.3482993 2.410015 0.86162883 3.0468733 -0.3903969 ]
[ -1.7452248 0.25237766 0.6007084 2.2371168 0.9400951 ]
[ -1.9570891 1.3037877 -0.24805197 1.6109428 -0.3572465 ]
[-15.548988 -4.129228 3.608777 -0.10240117 3.2107658 ]]
print(doc2vec_model.docvecs.most_similar('romance'))
[('comedy', 0.6839742660522461), ('animation', 0.6497607827186584), ('fantasy', 0.5627620220184326), ('sci-fi', 0.14199887216091156), ('action', 0.046558648347854614)]
"Romance" and "comedy" are fairly similar, while "action" and "sci-fi" are fairly dissimilar genres compared to "romance". So far so good. However, in order to visualize the results, I need to reduce the vector dimensionality. Therefore, I try first t-SNE and then PCA. This is the code and the results:
# TSNE
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)
df = pd.DataFrame(X_tsne, index=doc_tags, columns=['x', 'y'])
print(df)
x y
animation -162.499695 74.153679
fantasy -10.496888 93.687149
comedy -38.886723 -56.914558
action -76.036247 232.218231
romance 101.005371 198.827988
sci-fi 123.960182 20.141081
# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
df_1 = pd.DataFrame(X_pca, index=doc_tags, columns=['x', 'y'])
print(df_1)
x y
animation -3.060287 -1.474442
fantasy -2.815175 -0.888522
comedy -2.520171 2.244404
action -2.063809 -0.191137
romance -2.578774 0.370727
sci-fi 13.038214 -0.061030
There is something wrong. This is even more visible when I visualize the results:
TSNE:
PCA:
This is clearly not what the model has produced. I am sure I am missing something basic. If you have any suggestions, it would be greatly appreciated.
First, you're always going to lose some qualities of the full-dimensionality model when doing a 2D projection, as required for such visualizations. You just hope – & try to choose appropriate methods/parameters – that the important aspects are preserved. So there isn't necessarily anything 'wrong' when a particular visualization disappoints.
And especially with high-dimensional 'dense embeddings' like with word2vec/doc2vec, there's way more info in the full embedding than can be shown in the 2D projection. You may see some sensible micro-relationships in such a plot – close neighbors in a few places matching expectations – but the overall 'map' won't be nearly as interpretable as, well, a real map of a truly 2D surface.
But also: it looks like you're creating a 30-dimensional Doc2Vec model with only 6 document-tags. Because of the way Doc2Vec works, if there are only 5 unique tags, it's essentially the case that you're training on only 5 virtual documents, just chopped up into different fragments. It's as if you took all the 'comedy' reviews and concatenated them into one big doc, and the same with all the 'romance' reviews, etc.
For many uses of Doc2Vec, and particularly in the published papers that introduced the underlying 'Paragraph Vector' algorithm, it is more typical to use each document's unique ID as its 'tag', especially since many downstream uses then need a doc-vector per document, rather than per-known-category. This may better preserve/model information in the original data - whereas collapsing everything to just 6 mega-documents, and 6 summary tag-vectors, imposes more simple implied category-shapes.
Note that if using unique IDs as tags, you won't automatically wind up with one summary tag-vector per category that you can read from the model. But, you could synthesize such a vector, perhaps by simply averaging the vectors of all the docs in a certain category to get that category's centroid.
It's still sometimes valuable to use known-labels as document tags, either instead-of unique IDs (as you've done here), or in-addition-to unique IDs (using the option of more-than-one tag per training document).
But you should know using known-labels, and only known-labels, as tags can be limiting. (For example, if you instead trained a separate vector per document, you could then plot the docs via your visualization, and color the dots with known-labels, and see which categories tend to have large overlaps, and highlight certain datapoints that seem to challenge the categories, or have nearest-neighbors in a different category.)
t-SNE embedding: it is a common mistake to think that distances between points (or clusters) in the embedded space is proportional to the distance in the original space. This is a major drawback of t-SNE, for more information see here. Therefore you shouldn't draw any conclusions from the visualization.
PCA embedding: PCA corresponds to a rotation of the coordinate system into a new orthogonal coordinate system which optimally describes the variance of the data. When keeping all principal components the (euclidean) distances are preserved, however when reducing the dimension (e.g. to 2D) the points will be projected onto the axis with most variance and the distances might not correspond to the original distances anymore. Again it is difficult to draw conclusions about the distances of the points in the original space from the embedding.
I have a regression model where my target variable (days) quantitative values ranges between 2 to 30. My RMSE is 2.5 and all the other X variables(nominal) are categorical and hence I have dummy encoded them.
I want to know what would be a good value of RMSE? I want to get something within 1-1.5 or even lesser but I am unaware what I should do to achieve the same.
Note# I have already tried feature selection and removing features will less importance.
Any ideas would be appreciated.
If your x values are categorical then it does not necessarily make much sense binding them to a uniform grid. Who's to say category A and B should be spaced apart the same as B and C. Assuming that they are will only lead to incorrect representation of your results.
As your choice of scale is the unknowns, you would be better in terms of visualisation to set your uniform x grid as being the day number and then seeing where the categories would place on the y scale if given a linear relationship.
RMS Error doesn't come into it at all if you don't have quantitative data for x and y.
Question
I implemented a K-Means algorithm in Python. First I apply PCA and whitening to the input data. Then I use k-means to successfully subtract k centroids out of the data.
How can I use those centroids to understand the "features" learnt? Are the centroids already the features (doesn't seem like this to me) or do I need to combine them with the input data again?
Because of some answers: K-means is not "just" a method for clustering, instead it's a vector quantization method. That said the goal of k-means is to describe a dataset with a reduced number of feature vectors. Therefore there are big analogies to methods like Sparse Filtering/ Learning regarding the potential outcome.
Code Example
# Perform K-means, data already pre-processed
centroids = k_means(matrix_pca_whitened,1000)
# Assign data to centroid
idx,_ = vq(song_matrix_pca,centroids)
The clusters produced by the K-mean algorithms separate your input space into K regions. When you have new data, you can tell which region it belongs to, and thus classify it.
The centroids are just a property of these clusters.
You can have a look at the scikit-learn doc if you are unsure, and at the map to make sure you choose the right algorithm.
This is sort of a circular question: "understand" requires knowing something about the features outside of the k-means process. All that k-means does is to identify k groups of physical proximity. It says "there are clumps of stuff in these 'k' places, and here's how the all the points choose the nearest."
What this means in terms of the features is up to the data scientist, rather than any deeper meaning that k-means can ascribe. The variance of each group may tell you a little about how tightly those points are clustered. Do remember that k-means also chooses starting points at random; an unfortunate choice can easily give a sub-optimal description of the space.
A centroid is basically the "mean" of the cluster. If you can ascribe some deeper understanding from the distribution of centroids, great -- but that depends on the data and features, rather than any significant meaning devolving from k-means.
Is that the level of answer you need?
The centroids are in fact the features learnt. Since k-means is a method of vector quantization we look up which observation belongs to which cluster and therefore is best described by the feature vector (centroid).
By having one observation e.g. separated into 10 patches before, the observation might consist of 10 feature vectors max.
Example:
Method: K-means with k=10
Dataset: 20 observations divided into 2 patches each = 40 data vectors
We now perform K-means on this patched dataset and get the nearest centroid per patch. We could then create a vector for each of the 20 observations with the length 10 (=k) and if patch 1 belongs to centroid 5 and patch 2 belongs to centroid 9 the vector could look like: 0 - 0 - 0 - 0 - 1 - 0 - 0 - 0 - 1 - 0.
This means that this observation consists of the centroids/ features 5 and 9. You could also measure use the distance between patch and centroid instead of this hard assignment.