Machine Learning clustering with n-dimensional data in Python - python

I'm trying to figure out a procedure to perform clustering on a set of data with 52 dimensions. This is purely for my own learning so I have a data set of known fields. The data is from retrosheet.org Gamelogs using the World Series data set. I'm attempting to use only columns 25-77, so only the integers, ignoring the string data.
This is my first attempt at unsupervised learning and while I understand the concepts, I'm struggling to implement a solution in Python. I've been using scipy and numpy. If anyone knows a good place to start or some suggestions on tackling this problem, I'd appreciate it.

Scikit learn is the way to go for clustering in Python. See http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#example-cluster-plot-kmeans-digits-py for a demo and code for clustering with 64 features. It would be good to start with the tutorial at http://scikit-learn.org/stable/tutorial/basic/tutorial.html and apply what you learn there to your dataset and then to k-means clustering.

Related

Clustering text data based on sentiment?

I am scraping reviews off Amazon with the intent to perform sentiment analysis to classify them into positive, negative and neutral. Now the data I would get would be text and unlabeled.
My approach to this problem would be as following:-
1.) Label the data using clustering algorithms like DBScan, HDBScan or KMeans. The number of clusters would obviously be 3.
2.) Train a Classification algorithm on the labelled data.
Now I have never performed clustering on text data but I am familiar with the basics of clustering. So my question is:
Is my approach correct?
Any articles/blogs/tutorials I can follow for text based clustering since I am kinda new to this?
I have never done such an experiment but as far as I know, the most challenging part of this work is transforming the sentences or documents into fixed-length vectors (mapping into semantic space). I highly suggest using a sentiment analysis pipeline from huggingface library for embedding the sentences (in this way you might exploit some supervision). There are other options as well:
Using sentence-transformers library. (straightforward and still good)
Using BoW. (simplest way but hard to get what you want)
Using TF-IDF (still simple but may simply do the work)
After you reach this point (every review ==> fixed-length vector) you can exploit whatever you want to cluster them and look after the results.

Is there away to perform probabilistic PCA using python and sci-kit learn.?

Is there away to perform probabilistic PCA using python and sci-kit learn.? I am trying to perform ppca but I can't find a library that does it.
https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_vs_fa_model_selection.html
Theres an example that kind of gets into it and I think will help you. It looks like you have to do your own scoring to get the exact probabilistic PCA implementation you're after for your data. Probably playing around with the results of an implementation similar to that will help you figure out your issues.

Reverse-engineering a clustering algorithm from the clusters

I have a clustering of data performed by a human based solely on their knowledge of the system. I also have a feature vector for each element. I have no knowledge about the meaning of the features, nor do I know what the reasoning behind the human clustering was.
I have complete information about which elements belong to which cluster. I can assume that the human was not stupid and there is a way to derive the clustering from the features.
Is there an intelligent way to reverse-engineer the clustering? That is, how can I select the features and the clustering algorithm that will yield the same clustering most of the time (on this data set)?
So far I have tried the naive approach - going through the clustering algorithms provided by the sklearn library in python and comparing the obtained clusters to the source one. This approach does not yield good results.
My next approach would be to use some linear combinations of the features, or subsets of features. Here, again, my question is if there is a more intelligent way to do this than to go through as many combinations as possible.
I can't shake the feeling that this is a standard problem and I'm just missing the right term to find the solution on Google.
Are you sure it was done automatically?
It sounds to me as if you should be treating this as a classification problem: construct a classifier that does the same as the human did.

How to use the sklearn.cluster.MeanShift algorithm?

I have a problem where I need to predict a list of objects based on previous history of the usage of the objects. It is a recommendation system in short.
I figured I can use clustering on existing data, and then try to find pattern among the clusters.
For this I came acros scikit-learn library in python, and i think it will work.
But I need to know how I will use one of their clustering algorithms(say MeanShift) , since the examples they provide mostly work on their own datasets provided in the library itself.
So,
How do I organize my data so that I can use the MeanShift class from sklearn.cluster package?
My data points are multidimensional, so will I be able to use sklearn package in the first place? they haven't mentioned any constraints.
If I can cluster multidimensional data points, will I have to do dimensionality reduction? ( I don't know how to do this either, but I am aware of the concept)
I have done some data mining in one of my courses, but these are new waters for me, any help in terms of pointing to resources/tutorials will be appreciated hightly.
Thank you.

Document Clustering in python using SciKit

I recently started working on Document clustering using SciKit module in python. However I am having a hard time understanding the basics of document clustering.
What I know ?
Document clustering is typically done using TF/IDF. Which essentially
converts the words in the documents to vector space model which is
then input to the algorithm.
There are many algorithms like k-means, neural networks, hierarchical
clustering to accomplish this.
My Data :
I am experimenting with linkedin data, each document would be the
linkedin profile summary, I would like to see if similar job
documents get clustered together.
Current Challenges:
My data has huge summary descriptions, which end up becoming 10000's
of words when I apply TF/IDF. Is there any proper way to handle this
high dimensional data.
K - means and other algorithms requires I specify the no. of clusters
( centroids ), in my case I do not know the number of clusters
upfront. This I believe is a completely unsupervised learning. Are
there algorithms which can determine the no. of clusters themselves?
I've never worked with document clustering before, if you are aware
of tutorials , textbooks or articles which address this issue, please
feel free to suggest.
I went through the code on SciKit webpage, it consists of too many technical words which I donot understand, if you guys have any code with good explanation or comments please share. Thanks in advance.
My data has huge summary descriptions, which end up becoming 10000's of words when I apply TF/IDF. Is there any proper way to handle this high dimensional data.
My first suggestion is that you don't unless you absolutely have to, due to memory or execution time problems.
If you must handle it, you should use dimensionality reduction (PCA for example) or feature selection (probably better in your case, see chi2 for example)
K - means and other algorithms requires I specify the no. of clusters ( centroids ), in my case I do not know the number of clusters upfront. This I believe is a completely unsupervised learning. Are there algorithms which can determine the no. of clusters themselves?
If you look at the clustering algorithms available in scikit-learn, you'll see that not all of them require that you specify the number of clusters.
Another one that does not is hierarchical clustering, implemented in scipy. Also see this answer.
I would also suggest that you use KMeans and try to manually tweak the number of clusters until you are satisfied with the results.
I've never worked with document clustering before, if you are aware of tutorials , textbooks or articles which address this issue, please feel free to suggest.
Scikit has a lot of tutorials for working with text data, just use the "text data" search query on their site. One is for KMeans, others are for supervised learning, but I suggest you go over those too to get more familiar with the library. From a coding, style and syntax POV, unsupervised and supervised learning are pretty similar in scikit-learn, in my opinion.
Document clustering is typically done using TF/IDF. Which essentially converts the words in the documents to vector space model which is then input to the algorithm.
Minor correction here: TF-IDF has nothing to do with clustering. It is simply a method for turning text data into numerical data. It does not care what you do with that data (clustering, classification, regression, search engine things etc.) afterwards.
I understand the message you were trying to get across, but it is incorrect to say that "clustering is done using TF-IDF". It's done using a clustering algorithm, TF-IDF only plays a preprocessing role in document clustering.
For the large matrix after TF/IDF transformation, consider using sparse matrix.
You could try different k values. I am not an expert in unsupervised clustering algorithms, but I bet with such algorithms and different parameters, you could also end up with a varied number of clusters.
This link might be useful. It provides good amount of explanation for k-means clustering with a visual output http://brandonrose.org/clustering

Categories