Gensim Doc2Vec visualization issue when using t-SNE and/or PCA - python

I am trying to familiarize with Doc2Vec results by using a public dataset of movie reviews. I have cleaned the data and run the model. There are, as you can see below, 6 tags/genres. Each is a document with its vector representation.
doc_tags = list(doc2vec_model.docvecs.doctags.keys())
print(doc_tags)
X = doc2vec_model[doc_tags]
print(X)
['animation', 'fantasy', 'comedy', 'action', 'romance', 'sci-fi']
[[ -0.6630892 0.20754902 0.2949621 0.622197 0.15592825]
[ -1.0809666 0.64607996 0.3626246 0.9261689 0.31883526]
[ -2.3482993 2.410015 0.86162883 3.0468733 -0.3903969 ]
[ -1.7452248 0.25237766 0.6007084 2.2371168 0.9400951 ]
[ -1.9570891 1.3037877 -0.24805197 1.6109428 -0.3572465 ]
[-15.548988 -4.129228 3.608777 -0.10240117 3.2107658 ]]
print(doc2vec_model.docvecs.most_similar('romance'))
[('comedy', 0.6839742660522461), ('animation', 0.6497607827186584), ('fantasy', 0.5627620220184326), ('sci-fi', 0.14199887216091156), ('action', 0.046558648347854614)]
"Romance" and "comedy" are fairly similar, while "action" and "sci-fi" are fairly dissimilar genres compared to "romance". So far so good. However, in order to visualize the results, I need to reduce the vector dimensionality. Therefore, I try first t-SNE and then PCA. This is the code and the results:
# TSNE
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)
df = pd.DataFrame(X_tsne, index=doc_tags, columns=['x', 'y'])
print(df)
x y
animation -162.499695 74.153679
fantasy -10.496888 93.687149
comedy -38.886723 -56.914558
action -76.036247 232.218231
romance 101.005371 198.827988
sci-fi 123.960182 20.141081
# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
df_1 = pd.DataFrame(X_pca, index=doc_tags, columns=['x', 'y'])
print(df_1)
x y
animation -3.060287 -1.474442
fantasy -2.815175 -0.888522
comedy -2.520171 2.244404
action -2.063809 -0.191137
romance -2.578774 0.370727
sci-fi 13.038214 -0.061030
There is something wrong. This is even more visible when I visualize the results:
TSNE:
PCA:
This is clearly not what the model has produced. I am sure I am missing something basic. If you have any suggestions, it would be greatly appreciated.

First, you're always going to lose some qualities of the full-dimensionality model when doing a 2D projection, as required for such visualizations. You just hope – & try to choose appropriate methods/parameters – that the important aspects are preserved. So there isn't necessarily anything 'wrong' when a particular visualization disappoints.
And especially with high-dimensional 'dense embeddings' like with word2vec/doc2vec, there's way more info in the full embedding than can be shown in the 2D projection. You may see some sensible micro-relationships in such a plot – close neighbors in a few places matching expectations – but the overall 'map' won't be nearly as interpretable as, well, a real map of a truly 2D surface.
But also: it looks like you're creating a 30-dimensional Doc2Vec model with only 6 document-tags. Because of the way Doc2Vec works, if there are only 5 unique tags, it's essentially the case that you're training on only 5 virtual documents, just chopped up into different fragments. It's as if you took all the 'comedy' reviews and concatenated them into one big doc, and the same with all the 'romance' reviews, etc.
For many uses of Doc2Vec, and particularly in the published papers that introduced the underlying 'Paragraph Vector' algorithm, it is more typical to use each document's unique ID as its 'tag', especially since many downstream uses then need a doc-vector per document, rather than per-known-category. This may better preserve/model information in the original data - whereas collapsing everything to just 6 mega-documents, and 6 summary tag-vectors, imposes more simple implied category-shapes.
Note that if using unique IDs as tags, you won't automatically wind up with one summary tag-vector per category that you can read from the model. But, you could synthesize such a vector, perhaps by simply averaging the vectors of all the docs in a certain category to get that category's centroid.
It's still sometimes valuable to use known-labels as document tags, either instead-of unique IDs (as you've done here), or in-addition-to unique IDs (using the option of more-than-one tag per training document).
But you should know using known-labels, and only known-labels, as tags can be limiting. (For example, if you instead trained a separate vector per document, you could then plot the docs via your visualization, and color the dots with known-labels, and see which categories tend to have large overlaps, and highlight certain datapoints that seem to challenge the categories, or have nearest-neighbors in a different category.)

t-SNE embedding: it is a common mistake to think that distances between points (or clusters) in the embedded space is proportional to the distance in the original space. This is a major drawback of t-SNE, for more information see here. Therefore you shouldn't draw any conclusions from the visualization.
PCA embedding: PCA corresponds to a rotation of the coordinate system into a new orthogonal coordinate system which optimally describes the variance of the data. When keeping all principal components the (euclidean) distances are preserved, however when reducing the dimension (e.g. to 2D) the points will be projected onto the axis with most variance and the distances might not correspond to the original distances anymore. Again it is difficult to draw conclusions about the distances of the points in the original space from the embedding.

Related

Retrieving MST for geographic coordinates using scipy minimum spanning tree

I am trying to create a minimum spanning tree (MST) from geographical coordinates using scipy, but for the life of me I cannot understand how to extract information to it . The scipy documentation is not very clear and multiple searches have not provided results.
For context in total I have around 200k datapoints per set and they look like this
My final objective is to create a line vector that connects these points through the MST, more or less as they appear in the image above. But for that I need an ordered list of point indices (or coordinates) I can work with.
Most of all I would need help understanding how to use the output of minimum_spanning_tree but it might be that I am making mistakes along the path
Overall steps
The steps I take are:
Create the sparse matrix with coordinate info
provide the matrix to scipy.sparse.csgraph.minimum_spanning_tree
Do some magic to extract column values
This is the small sample test data:
test_data = {
"index": [0,1,2,3,4],
"X": [35,36,37,38,38],
"Y": [2113,2113,2112,2101,2102]
}
df= pd.DataFrame(test_data)
Step 1, create the sparse matrix
xs = df[["X"]].values.squeeze().astype(int)
ys = df[["Y"]].values.squeeze().astype(int)
data= np.array(df.index).squeeze().astype(int)
max_dim =max(np.max(xs), np.max(ys)) +1
max_dim
dist_matr=csr_matrix((data, (xs,ys)),shape=(max_dim, max_dim))
Q1:I couldn't understand what data is in this context as scipy docs do not explain that in detail. should {data} be the labels of the points or are they the edge weights?
Step2: calculate the minimum spanning tree
mst = minimum_spanning_tree(dist_matr)
Step3: get an ordered list of indices (or coordinates)
As I understand it the output of MST is a sparse graph that should look something like this (source)
Q2: However, my matrix is not 5X5, but max_value*max_value (2113 in this case). And it seems like the content of the matrix is not the edge weight. Am I getting this wrong?
I have tried to extract the connected components, but the labels don't make sense to me
# Label connected components.
num_graphs, labels = connected_components(mst, directed=False)
# This is a snippet I found somewhere but I have difficulties following the logic of it
results = [[] for i in range(max(labels) + 1)]
for idx, label in enumerate(labels):
results[label].append(idx)
portion of the results:
As you can see point coordinates are grouped in an odd way, without a relationship between x and y. I have also tried 'depth_first_order' but aside for asking a starting point (that I wouldn't know how to choose) it provides me with equally confusing outputs
Q4: How do I "read" the MST matrix and extract the minimum spanning tree for all points?
I am happy to explore other solutions as long as they provide a similar result and are scalable, however I have seen concerns about NetworkX for lots of data and MisTree doesn't install on my setup

Plot a matrix as a single point in space

I have a dataset of drugs represented as a graph, each of which is described by three non-square matrices:
edge index (A), an 2xe matrix, where e are the bonds of the molecule, the first line indicates the node (atom) from which the edge (bond) starts, and the second one the node where the edge arrives;
node feature matrix (X), an nx9 matrix, where n are the atoms of the molecule and 9 are the features used to describe these (e.g. atomic number, charge, hybridization);
edge feature matrix (E), an 4xe matrix, where e are the bonds of the molecule and 4 are the features used to describe these (e.g. type of bond, geometry).
I would like to plot these data on a Cartesian space to see if clusters are created based on their activity label. I thought, if I can reduce each matrix to a single point in space for each graph I will have three x, y, z coordinates, and then it will be very easy to plot the points. Does this make sense in your opinion? How could I go about turning a matrix into a single point using python? Finally, I leave you with an example of the graph I would like to create
Thank you all!
Assuming:
The nodes in a drug's graph represent features that every drug has to different extents, including zero.
The structure of a drug's graph models the extent to which every feature applies to that that drug
There is an algorithm to calculate from a drug's graph the 'extent' ( a number ) each feature applies the the drug
Then:
Construct a table where each row models a drug and each column is for a feature. Each cell then contains the "extent" to which the column's feature applies to the row's drug.
Apply the K-Means algorithm to the table.
The challenge is, of course: an algorithm to calculate from a drug's graph the 'extent' ( a number ) each feature.
IMHO the first step is to enter your data into a graph theory library. I see you are using Python. Python folks generally use a library called networkx. Are you familiar with this library?
Personally, I much prefer to work with C++ ( it gives the performance required for large problem sets ) Recently, I added a SMILES parser to my C++ graph library.
Convert the SMILES representation of each drug to its graph representation
Calculate the graph edit distance ( GED https://en.wikipedia.org/wiki/Graph_edit_distance ) between every pair of drugs
LOOP GEDMAX from 1 to 10
Add a connection between two drugs if the GED is less than GEDMAX. This forms a new graph we can call "GEDgraph"
Find the components ( clusters of drugs all reachable from each other in the GEDgraph )
SELECT "best" set of components

PCA Explained Variance Analysis

I'm very new to PCA.
I have 11 X variables for my model. These are the X variable labels
x = ['Day','Month', 'Year', 'Rolling Average','Holiday Effect', 'Day of the Week', 'Week of the Year', 'Weekend Effect', 'Last Day of the Month', "Quarter" ]
This is the graph I generated from the explained variance. With the x axis being the principal component.
[ 3.47567089e-01 1.72406623e-01 1.68663799e-01 8.86739892e-02
4.06427375e-02 2.75054035e-02 2.26578769e-02 5.72892368e-03
2.49272688e-03 6.37160140e-05]
I need to know whether I have a good selection of features. And how can I know which feature contributions the most.
from sklearn import decomposition
pca = decomposition.PCA()
pca.fit(X_norm)
scores = pca.explained_variance_
Though I do NOT know the dataset, I recommend that you scale your features before using PCA (variance will be maximized along the axes). I think X_norm refers to that in your code.
By using PCA, we are targeting to reduce dimensionality. In order to do that, we will start with a feature space which includes all X variables in your case, and will end up a projection of that space which typically is a different feature (sub)space.
In practice, when you have correlations between features, PCA can help you to project that correlation to smaller dimensions.
Think about this, if I'm holding a paper on my desk with full of dots on it, do I need the 3rd dimension to represent that dataset? Probably not, since all the dots are on paper and could be represented in 2D space.
When you are trying to decide how many principal components you will use from your new feature space, you can look at explained variance and it will tell you how much information is there for each principal component.
When I look at the principal components in your data, I see that ~85% of the variance could be attributed to first 6 principal components.
You can also set n_components. For example if you use n_components=2, then your transformed dataset will have 2 features.

t-SNE High Dimension Data Visualisation

I have a twitter corpus which I am using to build sentiment analysis application. The corpus has 5k tweets which have been hand labelled as - negative, neutral or positive
To represent the text - I am using gensim word2vec pretrained vectors. Each word is mapped to 300 dimensions. For a tweet, I add all the word vectors to get a single 300 dim vectors. Thus every tweet is mapped to a single vector of 300 dimension.
I am visualizing my data using t-SNE (tsne python package). See attached image 1 - Red points = negative tweets, Blue points = neutral tweets and Green points = Positive tweets
Question:
In the plot there no clear separation (boundary) among the data points. Can I assume this will also be the case with the original points in 300 Dimensions ?
i.e if points overlap in t-SNE graph then they also overlap in original space and vice-versa ?
Question: In the plot there no clear separation (boundary) among the data points. Can I assume this will also be the case with the original points in 300 Dimensions ?
In most cases NO. By reducing dimensions you will probably loose some information.
The case where you may reduce dimension without losing information is when or data in some dimensions is zero(for example line in 3dimensional space) or when some dimensions linearly dependent on other.
There are few tricks to test how good some dimensionality reductions techniques works. For example:
You may use PCA to reduce dimension form 300 to for example 10. You can calculate sum of 300 eigenvalues(original space) and sum of 10 biggest eigenvalues(these 10 eigenvalues represent eigenvectors that will be used for dimension reduction) and calculate percentage of lost information sum(top-10-eigenvalues)/sum(300-eigenvalues) .This value is not exactly "information" lost, but it is close to that.

Use K-means to learn features in Python

Question
I implemented a K-Means algorithm in Python. First I apply PCA and whitening to the input data. Then I use k-means to successfully subtract k centroids out of the data.
How can I use those centroids to understand the "features" learnt? Are the centroids already the features (doesn't seem like this to me) or do I need to combine them with the input data again?
Because of some answers: K-means is not "just" a method for clustering, instead it's a vector quantization method. That said the goal of k-means is to describe a dataset with a reduced number of feature vectors. Therefore there are big analogies to methods like Sparse Filtering/ Learning regarding the potential outcome.
Code Example
# Perform K-means, data already pre-processed
centroids = k_means(matrix_pca_whitened,1000)
# Assign data to centroid
idx,_ = vq(song_matrix_pca,centroids)
The clusters produced by the K-mean algorithms separate your input space into K regions. When you have new data, you can tell which region it belongs to, and thus classify it.
The centroids are just a property of these clusters.
You can have a look at the scikit-learn doc if you are unsure, and at the map to make sure you choose the right algorithm.
This is sort of a circular question: "understand" requires knowing something about the features outside of the k-means process. All that k-means does is to identify k groups of physical proximity. It says "there are clumps of stuff in these 'k' places, and here's how the all the points choose the nearest."
What this means in terms of the features is up to the data scientist, rather than any deeper meaning that k-means can ascribe. The variance of each group may tell you a little about how tightly those points are clustered. Do remember that k-means also chooses starting points at random; an unfortunate choice can easily give a sub-optimal description of the space.
A centroid is basically the "mean" of the cluster. If you can ascribe some deeper understanding from the distribution of centroids, great -- but that depends on the data and features, rather than any significant meaning devolving from k-means.
Is that the level of answer you need?
The centroids are in fact the features learnt. Since k-means is a method of vector quantization we look up which observation belongs to which cluster and therefore is best described by the feature vector (centroid).
By having one observation e.g. separated into 10 patches before, the observation might consist of 10 feature vectors max.
Example:
Method: K-means with k=10
Dataset: 20 observations divided into 2 patches each = 40 data vectors
We now perform K-means on this patched dataset and get the nearest centroid per patch. We could then create a vector for each of the 20 observations with the length 10 (=k) and if patch 1 belongs to centroid 5 and patch 2 belongs to centroid 9 the vector could look like: 0 - 0 - 0 - 0 - 1 - 0 - 0 - 0 - 1 - 0.
This means that this observation consists of the centroids/ features 5 and 9. You could also measure use the distance between patch and centroid instead of this hard assignment.

Categories