How to evaluate HDBSCAN text clusters? - python

I'm currently trying to use HDBSCAN to cluster movie data. The goal is to cluster similar movies together (based on movie info like keywords, genres, actor names, etc) and then apply LDA to each cluster and get the representative topics. However, I'm having a hard time evaluating the results (apart from visual analysis, which is not great as the data grows). With LDA, although it's hard to evaluate it, i've been using the coherence measure. However, does anyone have any idea on how to evaluate the clusters made by HDBSCAN? I haven't been able to find much info on it, so if anyone has any idea, I'd very much appreciate!

HDBSCAN implements Density-Based Clustering Validation in the method called relative_validity. It will allow you to compare one clustering, obtained with a given set of hyperparameters, to another one.
In general, read about cluster analysis and cluster validation.
Here's a good discussion about this with the author of the HDBSCAN library.

Its the same problem everywhere in unsupervised learning.
It is unsupervised, you are trying to discover something new and interesting. There is no way for the computer to decide whether something is actually interesting or new. It can decide and trivial cases when the prior knowledge is coded in machine processable form already, and you can compute some heuristics values as a proxy for interestingness. But such measures (including density-based measures such as DBCV are actually in no way better to judge this than the clustering algorithm itself is choosing the "best" solution).
But in the end, there is no way around manually looking at the data, and doing the next steps - try to put into use what you learned of the data. Supposedly you are not invory tower academic just doing this because of trying to make up yet another useless method... So use it, don't fake using it.

You can try the clusteval library. This library helps your to find the optimal number of clusters in your dataset, also for hdbscan. When you have the cluster labels, you can start enrichment analysis using hnet.
pip install clusteval
pip install hnet
Example:
# Import library
from clusteval import clusteval
# Set the method
ce = clusteval(method='hdbscan')
# Evaluate
results = ce.fit(X)
# Make plot of the evaluation
ce.plot()
# Make scatter plot using the first two coordinates.
ce.scatter(X)
So at this point you have the optimal detected cluster labels and now you may want to know whether there is association between any of the clusters with a (group of) feature(s) in your meta-data. The idea is to compute for each cluster label how often it is seen for a particular class in your meta-data. This can be defined with a P-value. The lower the P-value (below alpha=0.05), the less likely it happened by random chance.
results is a dict and contains the optimal cluster labels in the key labx. With hnet we can compute the enrichment very easily. More information can be found here: https://erdogant.github.io/hnet
# Import library
import hnet
# Get labels
clusterlabels = results['labx']
# Compute the enrichment of the cluster labels with the dataframe df
enrich_results = hnet.enrichment(df, clusterlabels)
When we look at the enrich_results, there is a column with category_label. These are the metadata variables of the dataframe df that we gave as an input. The second columns: P stands for P-value, which is the computed significance of the catagory_label with the target variable y. In this case, target variable y are are the cluster labels clusterlabels.
The target labels in y can be significantly enriched more then once. This means that certain y are enriched for multiple variables in the dataframe. This can occur because we may need to better estimate the cluster labels or its a mixed group or something else.
More information about cluster enrichment can be found here:
https://erdogant.github.io/hnet/pages/html/Use%20Cases.html#cluster-enrichment

Related

Is there a way to constrain a particular data point to a cluster with SciKit K Means or other clustering method?

I am using SciKit K Means cluster_centres_ method to give me a "centre of gravity" for weighted data points. I would like to be able to restrict a subset of data points to a particular cluster and have that affect the centre location. Is there a way to do this by preventing certain values from being attributed to the nearest cluster by default?
You should probably try DBSCAN, and see if it does what you want. Here is a link to several clustering algos.
https://machinelearningmastery.com/clustering-algorithms-with-python/
Just choose the model you want, fit it, and interpret the output.
Similar to the link above, you can test several clustering models using the sample code from the link below.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb

How does pairwise comparison training work in XGBoost XGBRanker?

I'm interested in learning to rank with pairwise comparison. While working on this I found that XGBoost has a model called XGBRanker which works very well.
I want to find out how the XGBRanker manages the training data to get such low memory usage and great results?(It uses LambdaMART I believe) I imagine it must be some kind of lookup table for the features and maybe making the pairs iteratively or not using all possible permutations with different labels within one group.
I tried looking through the source code but everything keeps referring to some other XGBoost method and I haven't been able to understand it so far.
I would like to create a similar method to train NNs for pairwise comparison but handling the training data has been a huge hurdle so far.
So more generally my Question would be: How are the pairs created in pairwise ranking anlgorithms?(RankNet,LambdaNet and so on) Are all pairs used? Only a percentage? Is there some other way of doing this? If you're working with >100.000 items you would easily get into the range of hundreds of millions.
I hope someone has some information about this or knows who might.

Is there a way in python to identify the exact values from a dataframe with multiple independent variables that generate these outliers?

I predicted a model with the xgboost algorithm on python and I graphed the predicted values vs the actal ones on a scatterplot (see image).
As you can see, there are several outliers (I drew a circle around them) who greatly damage the model and I would like to get rid of them.
Is there a way in python to identify the exact values from a dataframe with multiple independent variables that generate these outliers?[predicted vs actual values]
There is something called an anomaly/outlier detection system, should check that out.
Here is a link
There are several algorithms which are available in python for multivariate anomaly detection in sklearn like DBSCAN , Isolation Forest , One Class SVM etc and generally isolation forest is deemed have good anomaly/outlier detection when the dataset has high attributes. However more before using anomaly/outlier detection algorithms one needs to identify if these values are actually outlier or whether they are natural behaviour for the dataset. If yes then rather than removing the records one might have to normalize /bin or apply other feature engineering technique/ look at more complex algorithms to fit the data. What if the relation of the target variable and the independent variables is non-linear?

Validating document classification procedure using scikit-learn and NLTK (python 3.4) yielding awkward MDS stress

This is my first post on SO so I hope I'm not committing any posting crimes just yet ;-). This is verbose because part of what I am trying to do is to validate my process and ensure I understand how this is done without screwing up majorly. I will sum up my questions here:
How can I have a stress value in the 50's from an MDS? I thought it should be between 0 and 1.
Running a clustering function on coordinates obtained through MDS is a big no-no? I ask because my results do not change significantly doing so but it could just be because of my data
I want to validate my k value for the number of clusters using an "elbow" method. How can I compute this knowing that I rely on linkage() and fcluster() to plot a number of clusters against an error value? Any help on methods or calls to access that data or the data I need to compute it would be greatly appreciated.
I am working on a document classification scheme using python 3.4 for a pet projet I have where I want to feed a corpus of several thousand texts and classify them using hierarchical clustering. I also would like to use MDS to graphically represent the cluster structures (I will also use a dendrogram but want to give this a shot).
Anyway, first thing I want to do is validate my procedure to make sure I understand how this works. This is done using NLTK and scikit-learn. My objective is not to call one procedure in scikit-learn that would do everything. Rather, I want to compute my similarity matrix (using a procedure in NLTK for example) and then feed that into a clustering function, using the precomputed parameter in some of the methods I rely on.
So my steps are currently as follows:
Load corpus
Clean up corpus items: remove stop words and
unwanted chars (numerical values and other text that is not relevant
to my objective); use lemmatization (WordNet)
the end result is a matrix with n documents and m terms
Compute the similarity between documents: for each document, compute cosine
similarity against the matrix of terms.
To do that, I use TfidfVectorizer
Note: I am a python newbie so I may not do things in a pythonic way. I apologize in advance...
vectorizer = TfidfVectorizer(tokenizer = tokenize, preprocessor = preprocess)
sparse_matrix = vectorizer.fit_transform(term_dict.values())
The tokenizer and preprocessor methods are dummy methods I had to add so that it would not try and tokenize etc. my dictionary which was previously built.
The cosine similarity matrix is built using:
i = 0
return_matrix = [[0 for x in range(len(document_terms_list))] for x in range(len(document_terms_list))]
for index in enumerate(document_terms_list):
if (i < len(document_terms_list)):
similarity = cosine_similarity(sparse_matrix[i:i+1], sparse_matrix)
M = coo_matrix(similarity)
for k, j, v in zip(M.row, M.col, M.data):
return_matrix[i][j] = v
i += 1
So for 100 documents, return_matrix is basically 100 x 100 with each cell having a similarity between Doc_x and Doc_y.
My next step is to perform the clustering (I want to use complete using scipy's hierarchical clustering).
To reduce dimensionality and be able to visualize results, I first perform an MDS on the data:
mds = manifold.MDS(n_components = 2, dissimilarity = "precomputed", verbose = 1)
results = mds.fit(return_matrix)
coordinates = results.embedding_
My problem arises here: calling mds.stress_ reports a value of about 53. I was under the understanding that my stress value should be somewhere between 0 and 1. Ahem, needless to say that I am speechless with this... This would be my first question. When I print the similarity matrix etc. everything looks relatively good...
To build the clusters, I am currently passing in coordinates to the linkage() and fcluster() functions, i.e. I am passing in the MDS'ed version of my similarity matrix. Now, I wonder if this could be an issue although the results look ok when I look at the clusters assigned to my data. But conceptually, I am not sure this makes sense.
In trying to determine an ideal number of clusters, I want to use an "elbow" method, plotting the variance explained against the number of clusters to have an "ideal" cutoff. I am not sure I see this anywhere in the scikit-learn docs and tutorials. I see places where people do it in R etc. but when I use hierarchical clustering, how can I achieve this? I just don't know where to get the data from the API and what data I am looking for exactly.
Many thanks in advance. I apologize for the length of this post but I figured giving out some context might help.
Cheers,
Greg

Python: feature selection in sci-kit learn for a normal distribution

I have a pandas DataFrame whose index is unique user identifiers, columns corresponding to unique events, and values 1 (attended), 0 (did not attend), or NaN (wasn't invited/not relevant). The matrix is pretty sparse with respect to NaNs: there are several hundred events and most users were only invited to several tens at most.
I created some extra columns to measure the "success" which I define as just % attended relative to invites:
my_data['invited'] = my_data.count(axis=1)
my_data['attended'] = my_data.sum(axis=1)-my_data['invited']
my_data['success'] = my_data['attended']/my_data['invited']
Assume the following is true: the success data should be normally distributed with mean 0.80 and s.d. 0.10. When I look at the histogram of my_data['success'] it was not normal and skewed left. It is not important if this is true in reality. I just want to solve the technical problem I pose below.
So this is my problem: there are some events which I don't think are "good" in the sense that they are making the success data diverge from normal. I'd like to do "feature selection" on my events to pick a subset of them which makes the distribution of my_data['success'] as close to normal as possible in the sense of "convergence in distribution".
I looked at the scikit-learn "feature selection" methods here and the "Univariate feature selection" seems like it makes sense. But I'm very new to both pandas and scikit-learn and could really use help on how to actually implement this in code.
Constraints: I need to keep at least half the original events.
Any help would be greatly appreciated. Please share as many details as you can, I am very new to these libraries and would love to see how to do this with my DataFrame.
Thanks!
EDIT: After looking some more at the scikit-learn feature selection approaches, "Recursive feature selection" seems like it might make sense here too but I'm not sure how to build it up with my "accuracy" metric being "close to normally distributed with mean..."
Keep in mind that feature selection is to select features, not samples, i.e., (typically) the columns of your DataFrame, not the rows. So, I am not sure if feature selection is what you want: I understand that you want to remove those samples that cause the skew in your distribution?
Also, what about feature scaling, e.g., standardization, so that your data becomes normal distributed with mean=0 and sd=1?
The equation is simply z = (x - mean) / sd
To apply it to your DataFrame, you can simply do
my_data['success'] = (my_data['success'] - my_data['success'].mean(axis=0)) / (my_data['success'].std(axis=0))
However, don't forget to keep the mean and SD parameters to transform your test data, too. Alternatively, you could also use the StandardScaler from scikit-learn

Categories