t-SNE High Dimension Data Visualisation - python

I have a twitter corpus which I am using to build sentiment analysis application. The corpus has 5k tweets which have been hand labelled as - negative, neutral or positive
To represent the text - I am using gensim word2vec pretrained vectors. Each word is mapped to 300 dimensions. For a tweet, I add all the word vectors to get a single 300 dim vectors. Thus every tweet is mapped to a single vector of 300 dimension.
I am visualizing my data using t-SNE (tsne python package). See attached image 1 - Red points = negative tweets, Blue points = neutral tweets and Green points = Positive tweets
Question:
In the plot there no clear separation (boundary) among the data points. Can I assume this will also be the case with the original points in 300 Dimensions ?
i.e if points overlap in t-SNE graph then they also overlap in original space and vice-versa ?

Question: In the plot there no clear separation (boundary) among the data points. Can I assume this will also be the case with the original points in 300 Dimensions ?
In most cases NO. By reducing dimensions you will probably loose some information.
The case where you may reduce dimension without losing information is when or data in some dimensions is zero(for example line in 3dimensional space) or when some dimensions linearly dependent on other.
There are few tricks to test how good some dimensionality reductions techniques works. For example:
You may use PCA to reduce dimension form 300 to for example 10. You can calculate sum of 300 eigenvalues(original space) and sum of 10 biggest eigenvalues(these 10 eigenvalues represent eigenvectors that will be used for dimension reduction) and calculate percentage of lost information sum(top-10-eigenvalues)/sum(300-eigenvalues) .This value is not exactly "information" lost, but it is close to that.

Related

Visualize documents embeddings and clustering

I have the following dataframe:
print(df)
document embeddings
1 [-1.1132643 , 0.793635 , 0.8664889]
2 [-1.1132643 , 0.793635 , 0.8664889]
3 [-0.19276126, -0.48233205, 0.17549737]
4 [0.2080252 , 0.01567003, 0.0717131]
I want to cluster and visualize them to see the similarities between the documents. What is the best method/steps to do this?
This is just a small dataframe, the original dataframe has more than 20k documents.
Document vectors in your case reside in a 768-dimensional euclidean space. Meaning in a 768-dimensional coordinate space, each point represents a document. Assuming these have been trained correctly, it's safe to imagine that contextually similar documents should be closer to each other in this space as compared to different ones. This may allow you to apply a clustering method to club similar documents together.
For clustering, you can use multiple clustering techniques such as -
Kmeans (clusters based on euclidean distances)
Dbscan (clusters with the notion of density)
Gaussian mixtures (clusters based on a mixture of k gaussians)
You can use Silhouette score to find the optimal number of clusters for the clustering algorithm to best create separations in clusters.
For visualization, you can ONLY visualize in 3D or 2D space. This means you will have to use some dimensionality reduction methods to reduce the 768 dimensions to 3 dimensions or 2 dimensions.
This can be achieved with the following algorithms set to 2 or 3 components -
PCA
T-SNE
LDA (requires labels)
Once you have clustered the data AND reduced the dimensionality of the data separately, you can use matplotlib to plot each of the points in a 2D/3D space and color each point based on its cluster (0-7) to visualize documents and clusters.
#process flow
(20k,768) -> K-clusters (20k,1) ------|
|--- Visualize (3 axis, k colors)
(20k,768) -> Dim reduction (20k,3)----|
Here is an example of the goal you are trying to achieve -
Here, you see the first 2 components of data from T-SNE, and each color represents the clusters you have created from your clustering method of choice (deciding the number of clusters using silhouette score)
EDIT: You can apply dimensionality reduction to project your 768-dimensional data into a 3D or 2D space and THEN cluster using a clustering method. This would reduce the amount of computation you have to handle since now you are clustering only on 3 dimensions instead of 768, but at cost of information that might help you discriminate clusters better.
#process flow
|------------------------|
(20k,768) -> Dim reduction (20k,3)--| |-- Visualize
|--- K-Clusters (20k,1)--|

MNIST dataset conversion

I've trained a decision tree on a dataset (handwritten) which contains 8 x-y points sampled along the length of the number (number digit dataset). The test dataset given (assignment), is the MNIST dataset, which is the pixel intensities in a 28x28 bitmap image. I need to sample 8 points and along the trajectory of the number so that it performs well.
I'm doing this in Python. I don't know what to do with the image to sample those points. Any package/procedure will help.
Simply index the array as you would any other array. The pixel intensities are merely ints. e.g val = arr[3,9].
You do not have the stroke direction in mnist.
Hence, you cannot reliably infer such positions.
You can do the opposite though: render the stroke information as pixel image, train a classifier on that, and then test it with mnist.
There's an MNIST sequence dataset from Edwin de Jong:
Paper: https://arxiv.org/pdf/1611.03068.pdf
Github: https://edwin-de-jong.github.io/blog/mnist-sequence-data/
Blog: https://github.com/edwin-de-jong/mnist-digits-as-stroke-sequences/
and an MNIST classification using RNN by Ryan Epp:
https://www.ryanepp.com/blog/mnist-classification-using-stroke-paths
In both projects, the direction strokes will take at a T-junction depends on the algorithm and is often counterintuitive. This means that there's more to learn for sequences since many stroke patterns will produce the same image.

How should I represent my network?

I am sort of having a problem on how I should represent network for training my CNN. The input data consists of different sized images, in which the number of rows is constant (441 rows) and the dimension of the color is constant (RGB), but the number of columns is different. The CNN are supposed to generate a feature vector, in which the length of feature is dependent of the number of columns each image has.
Example:
An image shaped (441,300,3) should generate a feature vector of length 98
An image shaped (441,1209,3) should generate a feature vector of length 398
So nearly each column should generate 3 features.
Is it possible to make convolution using kernel that fill the whole column and give out 3 features? The reason why I am seeking for a kernel that fill the whole column, as I want to apply more importance on some areas of the columns, instead of giving all the area equal importance.
If so, how would go by designing such kernel or such a network in keras?

Use K-means to learn features in Python

Question
I implemented a K-Means algorithm in Python. First I apply PCA and whitening to the input data. Then I use k-means to successfully subtract k centroids out of the data.
How can I use those centroids to understand the "features" learnt? Are the centroids already the features (doesn't seem like this to me) or do I need to combine them with the input data again?
Because of some answers: K-means is not "just" a method for clustering, instead it's a vector quantization method. That said the goal of k-means is to describe a dataset with a reduced number of feature vectors. Therefore there are big analogies to methods like Sparse Filtering/ Learning regarding the potential outcome.
Code Example
# Perform K-means, data already pre-processed
centroids = k_means(matrix_pca_whitened,1000)
# Assign data to centroid
idx,_ = vq(song_matrix_pca,centroids)
The clusters produced by the K-mean algorithms separate your input space into K regions. When you have new data, you can tell which region it belongs to, and thus classify it.
The centroids are just a property of these clusters.
You can have a look at the scikit-learn doc if you are unsure, and at the map to make sure you choose the right algorithm.
This is sort of a circular question: "understand" requires knowing something about the features outside of the k-means process. All that k-means does is to identify k groups of physical proximity. It says "there are clumps of stuff in these 'k' places, and here's how the all the points choose the nearest."
What this means in terms of the features is up to the data scientist, rather than any deeper meaning that k-means can ascribe. The variance of each group may tell you a little about how tightly those points are clustered. Do remember that k-means also chooses starting points at random; an unfortunate choice can easily give a sub-optimal description of the space.
A centroid is basically the "mean" of the cluster. If you can ascribe some deeper understanding from the distribution of centroids, great -- but that depends on the data and features, rather than any significant meaning devolving from k-means.
Is that the level of answer you need?
The centroids are in fact the features learnt. Since k-means is a method of vector quantization we look up which observation belongs to which cluster and therefore is best described by the feature vector (centroid).
By having one observation e.g. separated into 10 patches before, the observation might consist of 10 feature vectors max.
Example:
Method: K-means with k=10
Dataset: 20 observations divided into 2 patches each = 40 data vectors
We now perform K-means on this patched dataset and get the nearest centroid per patch. We could then create a vector for each of the 20 observations with the length 10 (=k) and if patch 1 belongs to centroid 5 and patch 2 belongs to centroid 9 the vector could look like: 0 - 0 - 0 - 0 - 1 - 0 - 0 - 0 - 1 - 0.
This means that this observation consists of the centroids/ features 5 and 9. You could also measure use the distance between patch and centroid instead of this hard assignment.

Subtract mean from image

I'm implementing a CNN with Theano. In the paper, I have to do this image preprocess before train the CNN
We extracted RGB patches of 61x61 dimensions associated with each poselet activation, subtracted the mean and used this data to train the convnet model shown in Table 1
Can you tell me what does it mean with "subtracted the mean"? Tell me if these steps are correct (it is what I understood)
1) Compute the mean for Red Channel, Green Channel and Blue Channel for the whole image
2) For each pixel, subtract from red value the mean of red channel, from green value the mean of green channel and the same for the blue channel
3) Is it correct to have negative value or do I have use the abs?
Thanks all!!
You should read paper carefully, but what is the most probable is that they mean mean of the patches, so you have N matrices 61x61 pixels, which is equivalent of a vector of length 61^2 (if there are three channels then 3*61^2). What they do - they simple compute mean of each dimension, so they calculate mean over these N vectors in respect to each of the 3*61^2 dimensions. As the result they obtain a mean vector of length 3*61^2 (or mean matrix/mean patch if you prefer) and they substract it from all of these N patches. Resulting patches will have negatives values, it is perfectly fine, you should not take abs value, neural networks prefer this kind of data.
I would assume the mean mentioned in the paper is the mean over all images used in the training set (computed separately for each channel).
Several indications:
Caffe is a lib for ConvNets. In their tutorial they mention the compute image mean part: http://caffe.berkeleyvision.org/gathered/examples/imagenet.html
For this they use the following script: https://github.com/BVLC/caffe/blob/master/examples/imagenet/make_imagenet_mean.sh
which does what I indicated.
Google played around with ConvNets and published their code here: https://github.com/google/deepdream/blob/master/dream.ipynb and they do also use the mean of the training set.
This is of course only indirect evidence since I can not explain you why this happens. In fact I stumbled over this question while trying to figure out precisely that.
//EDIT:
In the mean time I found a source confirming my claim (Highlighting added by me):
There are three common forms of data preprocessing a data matrix X [...]
Mean subtraction is the most common form of preprocessing. It
involves subtracting the mean across every individual feature in the
data, and has the geometric interpretation of centering the cloud of
data around the origin along every dimension. In numpy, this operation
would be implemented as: X -= np.mean(X, axis = 0). With images
specifically, for convenience it can be common to subtract a single
value from all pixels (e.g. X -= np.mean(X)), or to do so separately
across the three color channels.
As we can see, the whole data is used to compute the mean.

Categories