Reverse-engineering a clustering algorithm from the clusters

Reverse-engineering a clustering algorithm from the clusters - python

I have a clustering of data performed by a human based solely on their knowledge of the system. I also have a feature vector for each element. I have no knowledge about the meaning of the features, nor do I know what the reasoning behind the human clustering was.
I have complete information about which elements belong to which cluster. I can assume that the human was not stupid and there is a way to derive the clustering from the features.
Is there an intelligent way to reverse-engineer the clustering? That is, how can I select the features and the clustering algorithm that will yield the same clustering most of the time (on this data set)?
So far I have tried the naive approach - going through the clustering algorithms provided by the sklearn library in python and comparing the obtained clusters to the source one. This approach does not yield good results.
My next approach would be to use some linear combinations of the features, or subsets of features. Here, again, my question is if there is a more intelligent way to do this than to go through as many combinations as possible.
I can't shake the feeling that this is a standard problem and I'm just missing the right term to find the solution on Google.

Are you sure it was done automatically?
It sounds to me as if you should be treating this as a classification problem: construct a classifier that does the same as the human did.

Related

Data Science: What is the best way to figure out the correlation between multiple characteristics and a performance score?

I am very new to data science, so I have a (basic?) question:
I have a set of materials (let's say plastics, glass, concrete…). I have a bunch of characteristics of each material (e.g. toughness, translucency) and for each of these materials I also have a score how they perform in a certain test.
Now I want to find out if there is some kind of correlation between the characteristics and the performance score. There is no linear correlation, I assume that it some combination of some (but not all) of the characteristics.
How do I go about finding out how they are "connected" ? What are the best methods? I was thinking of training a neural network but I don't have that much data and also, it seems like a bit of an overkill.
As I said I am very new to this so I am grateful for any hint or term I need to search for (I work with Python, btw).

To find correlation between characteristics of your materials and their performance in a certain test,you can try to use machine learning algorithms like a decision tree or a random forest. its simple to use with even less data.
you need to experiments different algorithms to find best approach that works for you.

Why random_state parameter is used in NMF and LDA algorithm ? What are the benefits of using random topics generated every time?

For Topic Modelling ,
Why random_state parameter is used in NMF and LDA algorithm ? What are the benefits of using random topics generated every time ?

The algorithms for both are stochastic - meaning they use randomness as a part of estimating a good answer. It's done that way to make it tractable, and in the case of LDA, the whole model is stochastic, providing you ideally with a probabilistic distribution (called "the posterior distribution") of answers, but instead providing a single, likely answer as an estimate.
So the answer is that using randomness in the algorithms makes a tremendously difficult problem much simpler and feasible to calculate in less than a hundred years.
If you're going to use them, I think it would do you well to study them, learn something of how they work, why they work. Using a tool that you don't understand is risky, as you don't really know what the result the tool provides actually means. One example is the numerors words in all "topics" with very low probability. The differences in these probabilities are actually meaningless - given a different sample from the posterior, you'd get different probabilities, ranked differently between words.

Predicting nucleotide sequence efficiency

I am new to machine learning and I am wondering whether it would be possible to use my available biological data for clustering. I want to find out whether a group of DNA sequences can be clustered into two groups, efficient and not efficient.
I have five sets, each containing about 480 short sequences (lets call them samples). Each set is having an effect with different strength:
Set1 - Very good effect
Set2 - Good effect
Set3 - Minor effect
Set4 - Very minor effect
Set5 - No effect
Each sample has some features, e.g. free energy,starting with a specific nucleotide...
Now my question is whether I can find out which type of sample in my sets are playing a role for the effect of the whole set. My only assumption is that in set1 I have more efficient samples then in set5 (either none or very few). A very simple (not realistic) result could be, all samples which start with nucleotide 'A' end end with nucleotide 'C' are causing the effect.
Is it possible to use machine learning to find out?
Thanks!

That definitely sounds like a problem where machine learning could give good results. I recommend that you look into scikit-learn, a powerful and easy to use toolkit for machine learning in Python. There are many introductory examples and tutorials available.
For your use case, I would say that random forests could give good results, although it's hard to say without knowing more about the structure of the data. They are available in the class RandomForestClassifier in sklearn. Again, there are many tutorials and examples to be found.
Since your training data is unlabeled, you may want to look into unsupervised learning methods. A simple class of such methods are clustering algorithms. In sklearn, you can find, for instance, k-means clustering along other such algorithms. The idea would be to let the algorithm split your data into different cluster and see if there is any correlation between cluster membership and observed effect.

It is unclear from your description what the 5 sets (what sound like labels) correspond to, but I will assume that you are essentially asking about feature learning: you would like to know which features to choose to best predict what set a given sequence is from. Determining this from scratch is an open problem in machine learning and there are many possible approaches depending on the particulars of your situation.
You can select a set of features (just by making logical guesses) and calculate them for all sequences, then perform PCA on all the vectors you have generated. PCA will give you the linear combination of features that accounts for the most variability in your data which is useful in designing meaningful features.

Document Clustering in python using SciKit

I recently started working on Document clustering using SciKit module in python. However I am having a hard time understanding the basics of document clustering.
What I know ?
Document clustering is typically done using TF/IDF. Which essentially
converts the words in the documents to vector space model which is
then input to the algorithm.
There are many algorithms like k-means, neural networks, hierarchical
clustering to accomplish this.
My Data :
I am experimenting with linkedin data, each document would be the
linkedin profile summary, I would like to see if similar job
documents get clustered together.
Current Challenges:
My data has huge summary descriptions, which end up becoming 10000's
of words when I apply TF/IDF. Is there any proper way to handle this
high dimensional data.
K - means and other algorithms requires I specify the no. of clusters
( centroids ), in my case I do not know the number of clusters
upfront. This I believe is a completely unsupervised learning. Are
there algorithms which can determine the no. of clusters themselves?
I've never worked with document clustering before, if you are aware
of tutorials , textbooks or articles which address this issue, please
feel free to suggest.
I went through the code on SciKit webpage, it consists of too many technical words which I donot understand, if you guys have any code with good explanation or comments please share. Thanks in advance.

My data has huge summary descriptions, which end up becoming 10000's of words when I apply TF/IDF. Is there any proper way to handle this high dimensional data.
My first suggestion is that you don't unless you absolutely have to, due to memory or execution time problems.
If you must handle it, you should use dimensionality reduction (PCA for example) or feature selection (probably better in your case, see chi2 for example)
K - means and other algorithms requires I specify the no. of clusters ( centroids ), in my case I do not know the number of clusters upfront. This I believe is a completely unsupervised learning. Are there algorithms which can determine the no. of clusters themselves?
If you look at the clustering algorithms available in scikit-learn, you'll see that not all of them require that you specify the number of clusters.
Another one that does not is hierarchical clustering, implemented in scipy. Also see this answer.
I would also suggest that you use KMeans and try to manually tweak the number of clusters until you are satisfied with the results.
I've never worked with document clustering before, if you are aware of tutorials , textbooks or articles which address this issue, please feel free to suggest.
Scikit has a lot of tutorials for working with text data, just use the "text data" search query on their site. One is for KMeans, others are for supervised learning, but I suggest you go over those too to get more familiar with the library. From a coding, style and syntax POV, unsupervised and supervised learning are pretty similar in scikit-learn, in my opinion.
Document clustering is typically done using TF/IDF. Which essentially converts the words in the documents to vector space model which is then input to the algorithm.
Minor correction here: TF-IDF has nothing to do with clustering. It is simply a method for turning text data into numerical data. It does not care what you do with that data (clustering, classification, regression, search engine things etc.) afterwards.
I understand the message you were trying to get across, but it is incorrect to say that "clustering is done using TF-IDF". It's done using a clustering algorithm, TF-IDF only plays a preprocessing role in document clustering.

For the large matrix after TF/IDF transformation, consider using sparse matrix.
You could try different k values. I am not an expert in unsupervised clustering algorithms, but I bet with such algorithms and different parameters, you could also end up with a varied number of clusters.

This link might be useful. It provides good amount of explanation for k-means clustering with a visual output http://brandonrose.org/clustering

Python KMeans clustering words

I am interested to perform kmeans clustering on a list of words with the distance measure being Leveshtein.
1) I know there are a lot of frameworks out there, including scipy and orange that has a kmeans implementation. However they all require some sort of vector as the data which doesn't really fit me.
2) I need a good clustering implementation. I looked at python-clustering and realize that it doesn't a) return the sum of all the distance to each centroid, and b) it doesn't have any sort of iteration limit or cut off which ensures the quality of the clustering. python-clustering and the clustering algorithm on daniweb doesn't really work for me.
Can someone find me a good lib? Google hasn't been my friend

Yeah I think there isn't a good implementation to what I need.
I have some crazy requirements, like distance caching etc.
So i think i will just write my own lib and release it as GPLv3 soon.

Not really an answer to your specific question, but I recommend glancing at "Programming Collective Intelligence". At the end of each chapter, e.g., clustering, it wanders off into describing all the best reading on the subject.

Maybe have a look at Weka. It is a Java library with some unsupervised learning implementations and nice visualization tools. It has been a while since I used it, not sure if it is great for a real production environment but defenitely a good starting point.

What about this very nice answer on CrossValidated?
It uses Affinity Propagation instead of k-means and in that case you can give as input a distance metric. I do not think any k-means based approach could work in your case since it is based on building a centroid and in order to do that you have to be in a vector space.
Affinity Propagation has the bonus that it selects automatically the number of clusters, which you can tweak (to have more or less clusters) by altering the preference (which by default is the median of all pairwise distance, but you can choose other percentiles).
If you need to specify the exact number of clusters, besides tweaking Affinity Propagation by trial and error, you could look for implementation of k-medoids (apparently there is no implementation of it in sklearn, but people have asked for it here and there). K-medoids does not build centroids, so it does not need the concept of vector space. So implementation might accept as input a precomputed distance matrix (haven't checked the references I give, though).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.