Document Clustering in python using SciKit

Document Clustering in python using SciKit - python

I recently started working on Document clustering using SciKit module in python. However I am having a hard time understanding the basics of document clustering.
What I know ?
Document clustering is typically done using TF/IDF. Which essentially
converts the words in the documents to vector space model which is
then input to the algorithm.
There are many algorithms like k-means, neural networks, hierarchical
clustering to accomplish this.
My Data :
I am experimenting with linkedin data, each document would be the
linkedin profile summary, I would like to see if similar job
documents get clustered together.
Current Challenges:
My data has huge summary descriptions, which end up becoming 10000's
of words when I apply TF/IDF. Is there any proper way to handle this
high dimensional data.
K - means and other algorithms requires I specify the no. of clusters
( centroids ), in my case I do not know the number of clusters
upfront. This I believe is a completely unsupervised learning. Are
there algorithms which can determine the no. of clusters themselves?
I've never worked with document clustering before, if you are aware
of tutorials , textbooks or articles which address this issue, please
feel free to suggest.
I went through the code on SciKit webpage, it consists of too many technical words which I donot understand, if you guys have any code with good explanation or comments please share. Thanks in advance.

My data has huge summary descriptions, which end up becoming 10000's of words when I apply TF/IDF. Is there any proper way to handle this high dimensional data.
My first suggestion is that you don't unless you absolutely have to, due to memory or execution time problems.
If you must handle it, you should use dimensionality reduction (PCA for example) or feature selection (probably better in your case, see chi2 for example)
K - means and other algorithms requires I specify the no. of clusters ( centroids ), in my case I do not know the number of clusters upfront. This I believe is a completely unsupervised learning. Are there algorithms which can determine the no. of clusters themselves?
If you look at the clustering algorithms available in scikit-learn, you'll see that not all of them require that you specify the number of clusters.
Another one that does not is hierarchical clustering, implemented in scipy. Also see this answer.
I would also suggest that you use KMeans and try to manually tweak the number of clusters until you are satisfied with the results.
I've never worked with document clustering before, if you are aware of tutorials , textbooks or articles which address this issue, please feel free to suggest.
Scikit has a lot of tutorials for working with text data, just use the "text data" search query on their site. One is for KMeans, others are for supervised learning, but I suggest you go over those too to get more familiar with the library. From a coding, style and syntax POV, unsupervised and supervised learning are pretty similar in scikit-learn, in my opinion.
Document clustering is typically done using TF/IDF. Which essentially converts the words in the documents to vector space model which is then input to the algorithm.
Minor correction here: TF-IDF has nothing to do with clustering. It is simply a method for turning text data into numerical data. It does not care what you do with that data (clustering, classification, regression, search engine things etc.) afterwards.
I understand the message you were trying to get across, but it is incorrect to say that "clustering is done using TF-IDF". It's done using a clustering algorithm, TF-IDF only plays a preprocessing role in document clustering.

For the large matrix after TF/IDF transformation, consider using sparse matrix.
You could try different k values. I am not an expert in unsupervised clustering algorithms, but I bet with such algorithms and different parameters, you could also end up with a varied number of clusters.

This link might be useful. It provides good amount of explanation for k-means clustering with a visual output http://brandonrose.org/clustering

Related

Clustering text data based on sentiment?

I am scraping reviews off Amazon with the intent to perform sentiment analysis to classify them into positive, negative and neutral. Now the data I would get would be text and unlabeled.
My approach to this problem would be as following:-
1.) Label the data using clustering algorithms like DBScan, HDBScan or KMeans. The number of clusters would obviously be 3.
2.) Train a Classification algorithm on the labelled data.
Now I have never performed clustering on text data but I am familiar with the basics of clustering. So my question is:
Is my approach correct?
Any articles/blogs/tutorials I can follow for text based clustering since I am kinda new to this?

I have never done such an experiment but as far as I know, the most challenging part of this work is transforming the sentences or documents into fixed-length vectors (mapping into semantic space). I highly suggest using a sentiment analysis pipeline from huggingface library for embedding the sentences (in this way you might exploit some supervision). There are other options as well:
Using sentence-transformers library. (straightforward and still good)
Using BoW. (simplest way but hard to get what you want)
Using TF-IDF (still simple but may simply do the work)
After you reach this point (every review ==> fixed-length vector) you can exploit whatever you want to cluster them and look after the results.

Which clustering method is the standard way to go for text analytics?

Assume you have lot of text sentences which may have (or not) similarities. Now you want to cluster similar sentences for finding centroids of each cluster. Which method is the prefered way for doing this kind of clustering? K-means with TF-IDF sounds promising. Nevertheless, are there more sophisticated algorithms or better ones? Data structure is tokenized and in a one-hot encoded format.

Basically you can cluster texts using different techniques. As you pointed out, K-means with TF-IDF is one of the ways to do this. Unfortunately, only using tf-idf won't be able to "detect" semantics and to project smantically similar texts near one another in the space. However, instead of using tf-idf, you can use word embeddings, such as word2vec or glove - there is a lot of information on the net about them, just google it. Have you ever heard of topic models? Latent Dirichlet allocation (LDA) is a topic model and it observes each document as a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics (see the wikipedia link). So, basically, using a topic model you can also do some kind of grouping and assign similar texts (with a similar topic) to groups. I recommend you to read about topic models, since they are more common for such problems connected with text clustering.
I hope my answer was helpful.

in my view, You can use LDA(latent Dirichlet allocation, it is more flexible in comparison to other clustering techniques because of having Alpha and Beta vectors that can adjust to the contribution of each topic in a document and word in a topic. It can help if the documents are not of similar length or quality.

Tweet classification into multiple categories on (Unsupervised data/tweets)

I want to classify the tweets into predefined categories (like: sports, health, and 10 more). If I had labeled data, I would be able to do the classification by training Naive Bayes or SVM. As described in http://cucis.ece.northwestern.edu/publications/pdf/LeePal11.pdf
But I cannot figure out a way with unlabeled data. One possibility could be using Expectation-Maximization and generating clusters and label those clusters. But as said earlier I have predefined set of classes, so clustering won't be as good.
Can anyone guide me on what techniques I should follow. Appreciate any help.

Alright by what i can understand i think there are multiple ways to attend to this case.
there will be trade offs and the accuracy rate may vary. because of the well know fact and observation
Each single tweet is distinct!
(unless you are extracting data from twitter stream api based on tags and other keywords). Please define the source of data and how are you extracting it. i am assuming you're just getting general tweets which can be about anything
The thing you can do is to generate a set of dictionary for each class you have
(i.e Music => pop , jazz , rap , instruments ...)
which will contain relevant words to that class. You can use NLTK for python or Stanford NLP for other languages.
You can start with extracting
Synonyms
Hyponyms
Hypernyms
Meronyms
Holonyms
Go see these NLP Lexical semantics slides. it will surely clear some of the concepts.
Once you have dictionaries for each classes. cross compare them with the tweets you have got. the tweet which has the most similarity (you can rank them according to the occurrences of words from the these dictionaries) you can label it to that class. This will make your tweets labeled like others.
Now the question is the accuracy! But it depends on the data and versatility of your classes. This may be an "Over kill" But it may come close to what you want.
Furthermore you can label some set of tweets this way and use Cosine Similarity to cross identify other tweets. This will help with the optimization part. But then again its up-to you. As you know what Trade offs you can bear
The real struggle will be the machine learning part and how you manage that.

Actually this seems as a typical use case of semi-supervised learning. There are plenty methods of use here, including clustering with constraints (where you force model to cluster samples from the same class together), transductive learning (where you try to extrapolate model from labeled samples onto distribution of unlabeled ones).
You could also simply cluster data as #Shoaib suggested, but then you will have to come up the the heuristic approach how to deal with clusters with mixed labeling. Futhermore - obviously solving optimziation problem not related to the task (labeling) will not be as good as actually using this knowledge.

You can use clustering for that task. For that you have to label some examples for each class first. Then using these labeled examples, you can identify the class of each cluster easily.

Reverse-engineering a clustering algorithm from the clusters

I have a clustering of data performed by a human based solely on their knowledge of the system. I also have a feature vector for each element. I have no knowledge about the meaning of the features, nor do I know what the reasoning behind the human clustering was.
I have complete information about which elements belong to which cluster. I can assume that the human was not stupid and there is a way to derive the clustering from the features.
Is there an intelligent way to reverse-engineer the clustering? That is, how can I select the features and the clustering algorithm that will yield the same clustering most of the time (on this data set)?
So far I have tried the naive approach - going through the clustering algorithms provided by the sklearn library in python and comparing the obtained clusters to the source one. This approach does not yield good results.
My next approach would be to use some linear combinations of the features, or subsets of features. Here, again, my question is if there is a more intelligent way to do this than to go through as many combinations as possible.
I can't shake the feeling that this is a standard problem and I'm just missing the right term to find the solution on Google.

Are you sure it was done automatically?
It sounds to me as if you should be treating this as a classification problem: construct a classifier that does the same as the human did.

Feature Selection and Reduction for Text Classification

I am currently working on a project, a simple sentiment analyzer such that there will be 2 and 3 classes in separate cases. I am using a corpus that is pretty rich in the means of unique words (around 200.000). I used bag-of-words method for feature selection and to reduce the number of unique features, an elimination is done due to a threshold value of frequency of occurrence. The final set of features includes around 20.000 features, which is actually a 90% decrease, but not enough for intended accuracy of test-prediction. I am using LibSVM and SVM-light in turn for training and prediction (both linear and RBF kernel) and also Python and Bash in general.
The highest accuracy observed so far is around 75% and I need at least 90%. This is the case for binary classification. For multi-class training, the accuracy falls to ~60%. I need at least 90% at both cases and can not figure how to increase it: via optimizing training parameters or via optimizing feature selection?
I have read articles about feature selection in text classification and what I found is that three different methods are used, which have actually a clear correlation among each other. These methods are as follows:
Frequency approach of bag-of-words (BOW)
Information Gain (IG)
X^2 Statistic (CHI)
The first method is already the one I use, but I use it very simply and need guidance for a better use of it in order to obtain high enough accuracy. I am also lacking knowledge about practical implementations of IG and CHI and looking for any help to guide me in that way.
Thanks a lot, and if you need any additional info for help, just let me know.
#larsmans: Frequency Threshold: I am looking for the occurrences of unique words in examples, such that if a word is occurring in different examples frequently enough, it is included in the feature set as a unique feature.
#TheManWithNoName: First of all thanks for your effort in explaining the general concerns of document classification. I examined and experimented all the methods you bring forward and others. I found Proportional Difference (PD) method the best for feature selection, where features are uni-grams and Term Presence (TP) for the weighting (I didn't understand why you tagged Term-Frequency-Inverse-Document-Frequency (TF-IDF) as an indexing method, I rather consider it as a feature weighting approach). Pre-processing is also an important aspect for this task as you mentioned. I used certain types of string elimination for refining the data as well as morphological parsing and stemming. Also note that I am working on Turkish, which has different characteristics compared to English. Finally, I managed to reach ~88% accuracy (f-measure) for binary classification and ~84% for multi-class. These values are solid proofs of the success of the model I used. This is what I have done so far. Now working on clustering and reduction models, have tried LDA and LSI and moving on to moVMF and maybe spherical models (LDA + moVMF), which seems to work better on corpus those have objective nature, like news corpus. If you have any information and guidance on these issues, I will appreciate. I need info especially to setup an interface (python oriented, open-source) between feature space dimension reduction methods (LDA, LSI, moVMF etc.) and clustering methods (k-means, hierarchical etc.).

This is probably a bit late to the table, but...
As Bee points out and you are already aware, the use of SVM as a classifier is wasted if you have already lost the information in the stages prior to classification. However, the process of text classification requires much more that just a couple of stages and each stage has significant effects on the result. Therefore, before looking into more complicated feature selection measures there are a number of much simpler possibilities that will typically require much lower resource consumption.
Do you pre-process the documents before performing tokensiation/representation into the bag-of-words format? Simply removing stop words or punctuation may improve accuracy considerably.
Have you considered altering your bag-of-words representation to use, for example, word pairs or n-grams instead? You may find that you have more dimensions to begin with but that they condense down a lot further and contain more useful information.
Its also worth noting that dimension reduction is feature selection/feature extraction. The difference is that feature selection reduces the dimensions in a univariate manner, i.e. it removes terms on an individual basis as they currently appear without altering them, whereas feature extraction (which I think Ben Allison is referring to) is multivaritate, combining one or more single terms together to produce higher orthangonal terms that (hopefully) contain more information and reduce the feature space.
Regarding your use of document frequency, are you merely using the probability/percentage of documents that contain a term or are you using the term densities found within the documents? If category one has only 10 douments and they each contain a term once, then category one is indeed associated with the document. However, if category two has only 10 documents that each contain the same term a hundred times each, then obviously category two has a much higher relation to that term than category one. If term densities are not taken into account this information is lost and the fewer categories you have the more impact this loss with have. On a similar note, it is not always prudent to only retain terms that have high frequencies, as they may not actually be providing any useful information. For example if a term appears a hundred times in every document, then it is considered a noise term and, while it looks important, there is no practical value in keeping it in your feature set.
Also how do you index the data, are you using the Vector Space Model with simple boolean indexing or a more complicated measure such as TF-IDF? Considering the low number of categories in your scenario a more complex measure will be beneficial as they can account for term importance for each category in relation to its importance throughout the entire dataset.
Personally I would experiment with some of the above possibilities first and then consider tweaking the feature selection/extraction with a (or a combination of) complex equations if you need an additional performance boost.
Additional
Based on the new information, it sounds as though you are on the right track and 84%+ accuracy (F1 or BEP - precision and recall based for multi-class problems) is generally considered very good for most datasets. It might be that you have successfully acquired all information rich features from the data already, or that a few are still being pruned.
Having said that, something that can be used as a predictor of how good aggressive dimension reduction may be for a particular dataset is 'Outlier Count' analysis, which uses the decline of Information Gain in outlying features to determine how likely it is that information will be lost during feature selection. You can use it on the raw and/or processed data to give an estimate of how aggressively you should aim to prune features (or unprune them as the case may be). A paper describing it can be found here:
Paper with Outlier Count information
With regards to describing TF-IDF as an indexing method, you are correct in it being a feature weighting measure, but I consider it to be used mostly as part of the indexing process (though it can also be used for dimension reduction). The reasoning for this is that some measures are better aimed toward feature selection/extraction, while others are preferable for feature weighting specifically in your document vectors (i.e. the indexed data). This is generally due to dimension reduction measures being determined on a per category basis, whereas index weighting measures tend to be more document orientated to give superior vector representation.
In respect to LDA, LSI and moVMF, I'm afraid I have too little experience of them to provide any guidance. Unfortunately I've also not worked with Turkish datasets or the python language.

I would recommend dimensionality reduction instead of feature selection. Consider either singular value decomposition, principal component analysis, or even better considering it's tailored for bag-of-words representations, Latent Dirichlet Allocation. This will allow you to notionally retain representations that include all words, but to collapse them to fewer dimensions by exploiting similarity (or even synonymy-type) relations between them.
All these methods have fairly standard implementations that you can get access to and run---if you let us know which language you're using, I or someone else will be able to point you in the right direction.

There's a python library for feature selection
TextFeatureSelection. This library provides discriminatory power in the form of score for each word token, bigram, trigram etc.
Those who are aware of feature selection methods in machine learning, it is based on filter method and provides ML engineers required tools to improve the classification accuracy in their NLP and deep learning models. It has 4 methods namely Chi-square, Mutual information, Proportional difference and Information gain to help select words as features before being fed into machine learning classifiers.
from TextFeatureSelection import TextFeatureSelection
#Multiclass classification problem
input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?']
target=['Positive','Positive','Negative','Neutral','Neutral']
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
#Binary classification
input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
Edit:
It now has genetic algorithm for feature selection as well.
from TextFeatureSelection import TextFeatureSelectionGA
#Input documents: doc_list
#Input labels: label_list
getGAobj=TextFeatureSelectionGA(percentage_of_token=60)
best_vocabulary=getGAobj.getGeneticFeatures(doc_list=doc_list,label_list=label_list)
Edit2
There is another method nowTextFeatureSelectionEnsemble, which combines feature selection while ensembling. It does feature selection for base models through document frequency thresholds. At ensemble layer, it uses genetic algorithm to identify best combination of base models and keeps only those.
from TextFeatureSelection import TextFeatureSelectionEnsemble
imdb_data=pd.read_csv('../input/IMDB Dataset.csv')
le = LabelEncoder()
imdb_data['labels'] = le.fit_transform(imdb_data['sentiment'].values)
#convert raw text and labels to python list
doc_list=imdb_data['review'].tolist()
label_list=imdb_data['labels'].tolist()
#Initialize parameter for TextFeatureSelectionEnsemble and start training
gaObj=TextFeatureSelectionEnsemble(doc_list,label_list,n_crossvalidation=2,pickle_path='/home/user/folder/',average='micro',base_model_list=['LogisticRegression','RandomForestClassifier','ExtraTreesClassifier','KNeighborsClassifier'])
best_columns=gaObj.doTFSE()`
Check the project for details: https://pypi.org/project/TextFeatureSelection/

Linear svm is recommended for high dimensional features. Based on my experience the ultimate limitation of SVM accuracy depends on the positive and negative "features". You can do a grid search (or in the case of linear svm you can just search for the best cost value) to find the optimal parameters for maximum accuracy, but in the end you are limited by the separability of your feature-sets. The fact that you are not getting 90% means that you still have some work to do finding better features to describe your members of the classes.

I'm sure this is way too late to be of use to the poster, but perhaps it will be useful to someone else. The chi-squared approach to feature reduction is pretty simple to implement. Assuming BoW binary classification into classes C1 and C2, for each feature f in candidate_features calculate the freq of f in C1; calculate total words C1; repeat calculations for C2; Calculate a chi-sqaure determine filter candidate_features based on whether p-value is below a certain threshold (e.g. p < 0.05). A tutorial using Python and nltk can been seen here: http://streamhacker.com/2010/06/16/text-classification-sentiment-analysis-eliminate-low-information-features/ (though if I remember correctly, I believe the author incorrectly applies this technique to his test data, which biases the reported results).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.