I am working on a problem where I have the Text data with around 10,000 documents. I have create a app where if user enters some random comment , It should display all the similar comments/documents present in the training data.
Exactly like in Stack overflow, if you ask an question it shows all related questions asked earlier.
So if anyone has any suggestions how to do it please answer.
Second I am trying LDA(Latent Dirichlet Allocation) algorithm, where I can get the topic with which my new document belongs to, but how will I get the similar documents from training data. Also how shall I choose the num_topics in LDA.
If anyone has any suggestions of algorithms other than LDA , please tell me.
You can try the following:
Doc2vec - this is an extension of the extremely popular word2vec algorithm, which maps words to an N-dimensional vector space such that words that occur in close proximity in your document will occur in close proximity in the vector space. U can use pre-trained word embeddings. Learn more on word2vec here. Doc2vec is an extension of word2vec. This will allow you to map each document to a vector of dimension N. After this, you can use any distance measure to find the most similar documents to an input document.
Word Mover's Distance - This is directly suited to your purpose and also uses word embeddings. I have used it in one of my personal projects and had achieved really good results. Find more about it here
Also, make sure you apply appropriate text cleaning before applying the algorithms. Steps like case normalization, stopword removal, punctuation removal, etc. It really depends on your dataset. Find out more here
I hope this was helpful...
Related
First, I want to explain my task. I have a dataset of 300k documents with an average of 560 words (no stop word removal yet) 75% in German, 15% in English and the rest in different languages. The goal is to recommend similar documents based on an existing one. At the beginning I want to focus on the German and English documents.
To achieve this goal I looked into several methods on feature extraction for document similarity, especially the word embedding methods have impressed me because they are context aware in contrast to simple TF-IDF feature extraction and the calculation of cosine similarity.
I'm overwhelmed by the amount of methods I could use and I haven't found a proper evaluation of those methods yet. I know for sure that the size of my documents are too big for BERT, but there is FastText, Sent2Vec, Doc2Vec and the Universal Sentence Encoder from Google. My favorite method based on my research is Doc2Vec even though there aren't any or old pre-trained models which means I have to do the training on my own.
Now that you know my task and goal, I have the following questions:
Which method should I use for feature extraction based on the rough overview of my data?
My dataset is too small to train Doc2Vec on it. Do I achieve good results if I train the model on English / German Wikipedia?
You really have to try the different methods on your data, with your specific user tasks, with your time/resources budget to know which makes sense.
You 225K German documents and 45k English documents are each plausibly large enough to use Doc2Vec - as they match or exceed some published results. So you wouldn't necessarily need to add training on something else (like Wikipedia) instead, and whether adding that to your data would help or hurt is another thing you'd need to determine experimentally.
(There might be special challenges in German given compound words using common-enough roots but being individually rare, I'm not sure. FastText-based approaches that use word-fragments might be helpful, but I don't know a Doc2Vec-like algorithm that necessarily uses that same char-ngrams trick. The closest that might be possible is to use Facebook FastText's supervised mode, with a rich set of meaningful known-labels to bootstrap better text vectors - but that's highly speculative and that mode isn't supported in Gensim.)
Assume you have lot of text sentences which may have (or not) similarities. Now you want to cluster similar sentences for finding centroids of each cluster. Which method is the prefered way for doing this kind of clustering? K-means with TF-IDF sounds promising. Nevertheless, are there more sophisticated algorithms or better ones? Data structure is tokenized and in a one-hot encoded format.
Basically you can cluster texts using different techniques. As you pointed out, K-means with TF-IDF is one of the ways to do this. Unfortunately, only using tf-idf won't be able to "detect" semantics and to project smantically similar texts near one another in the space. However, instead of using tf-idf, you can use word embeddings, such as word2vec or glove - there is a lot of information on the net about them, just google it. Have you ever heard of topic models? Latent Dirichlet allocation (LDA) is a topic model and it observes each document as a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics (see the wikipedia link). So, basically, using a topic model you can also do some kind of grouping and assign similar texts (with a similar topic) to groups. I recommend you to read about topic models, since they are more common for such problems connected with text clustering.
I hope my answer was helpful.
in my view, You can use LDA(latent Dirichlet allocation, it is more flexible in comparison to other clustering techniques because of having Alpha and Beta vectors that can adjust to the contribution of each topic in a document and word in a topic. It can help if the documents are not of similar length or quality.
I am running LDA on health-related data. Specifically I have ~500 documents that contain interviews that last around 5-7 pages. While I cannot really go into the details of the data or results due to preserving data integrity/confidentiality, I will describe the results and go through the procedure to give a better idea of what I am doing and where I can improve.
For the results, I chose 20 topics and outputted 10 words per topic. Although 20 was somewhat arbitrary and I did not have a clear idea of a good amount of topics, that seemed like a good amount given the size of the data and that they are all health-specific. However, the results highlighted two issues: 1) it is unclear what the topics were since the words within each topic did not necessarily go together or tell a story and 2) many of the words among the various topics overlapped, and there were a few words that showed up in most topics.
In terms of what I did, I first preprocessed the text. I converted everything to lowercase, removed punctuation, removed unnecessary codings specific to the set of documents at hand. I then tokenized the documents, lemmatized the words, and performed tf-idf. I used sklearn's tf-idf capabilities and within tf-idf initialization, I specified a customized list of stopwords to be removed (which added to nltk's set of stopwords). I also set max_df to 0.9 (unclear what a good number is, I just played around with different values), min_df to 2, and max_features to 5000. I tried both tf-idf and bag of words (count vectorizer), but I found tf-idf to give slightly clearer and more distinct topics while analyzing the LDA output. After this was done, I then ran an LDA model. I set the number of topics to be 20 and the number of iterations to 5.
From my understanding, each decision I made above may have contributed to the LDA model's ability to identify clear, meaningful topics. I know that text processing plays a huge role in LDA performance, and the better job I do there, the more insightful the LDA will be.
Is there anything glaringly wrong or something I missed out. Do you have any suggested values/explorations for any of the parameters I described above?
How detailed, nit-picky should I be when filtering out potential domain-specific stopwords?
How do I determine a good number of topics and iterations during the LDA step?
How can I go about validating performance, other than qualitatively comparing output?
I appreciate all insights and input. I am completely new to the area of topic modeling and while I have read some articles, I have a lot to learn! Thank you!
How do I determine a good number of topics and iterations during the LDA step?
This is the most difficult question in clustering algorithms like LDA. There is a metric that can determine which number of cluster is the best https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_coherence_tutorial.ipynb
In my experience optimizing this metric by tuning number of topics, iterations or another hyper-parameters won't necessarily give you interpretable topics.
How can I go about validating performance, other than qualitatively comparing output?
Again you may use the above metric to validate the performance, but I also found useful visualization of the topics http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb
This not only gives you topic histograms but also shows how apart are the topics, which again may help with finding out optimal number of topics.
In my studies I was not using scikit, but rather gensim.
I have a corpus that contains several documents, for example 10 documents. The idea is to compute the similarity between them and combine the most similar ones into one document. So the result may be 4 documents. What I have done so far is that I iterate over the documents and calculate the most two similar documents and combine them into one document and so on until a threshold. I used Word2vec vectors by taking the mean vector for the whole document. The problem is that as i proceed with the iteration, the longer the document the more similar even if its not that similar due to the presence of more words. Any idea on how to approach this problem?
I used google Word2vec model. The reason: The corpus is not big to train a model.
Note : I do not want to use topic modeling for some specifications. Also the documents are really short, more than half of them may be one sentence.
I really appreciate your suggestions.
Latent Dirichlet Allocation(LDA) is a topic model to find latent variable (topics) underlying a bunch of documents. I'm using python gensim package and having two problems:
I printed out the most frequent words for each topic (I tried 10,20,50 topics), and found out that the distribution over words is very "flat": meaning even the most frequent word has only 1% probability...
Most of the topics are similar: meaning the most frequent words for each of the topics overlap a lot and the topics share almost the same set of words for their high frequency words...
I guess the problem is probably due to my documents: my documents actually belong to a specific category, for example, they are all documents introducing different online games. For my case, will LDA still work, since the documents themselves are quite similar, so a model based on "bag of words" may not be a good way to try?
Could anyone give me some suggestions? Thank you!
I've found NMF to perform better when a corpus is smaller and more focused around a particular topic. In a corpus of ~250 documents all discussing the same issue NMF was able to pull 7 distinct coherent topics. This has also been reported by other researchers...
"Another advantage that is particularly useful for the appli- cation
presented in this paper is that NMF is capable of identifying niche
topics that tend to be under-reported in traditional LDA approaches" (p.6)
Greene & Cross, Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach, PDF
Unfortunately Gensim doesn't have an implementation of NMF but it is in Scikit-Learn. To work effectively you need to feed NMF some TFIDF weighted word vectors rather than frequency counts like you do with LDA.
If you're used to Gensim and have preprocessed everything that way genesis has some utilities to convert a corpus top Scikit compatible structures. However I think it would actually be simpler to just use all Scikit. There is a good example of using NMF here.