I am working on a language that is the not english and I have scraped the data from different sources. I have done my preprocessing like punctuation removal, stop-words removal and tokenization. Now I want to extract domain specific lexicons. Let's say that I have data related to sports, entertainment, etc and I want to extract words that are related to these particular fields, like cricket etc, and place them in topics that are closely related. I tried to use lda for this, but I am not getting the correct clusters. Also in the clusters in which a word which is a part of one topic, it also appears in other topics.
How can I improve my results?
# URDU STOP WORDS REMOVAL
doc_clean = []
stopwords_corpus = UrduCorpusReader('./data', ['stopwords-ur.txt'])
stopwords = stopwords_corpus.words()
# print(stopwords)
for infile in (wordlists.fileids()):
words = wordlists.words(infile)
#print(words)
finalized_words = remove_urdu_stopwords(stopwords, words)
doc = doc_clean.append(finalized_words)
print("\n==== WITHOUT STOPWORDS ===========\n")
print(finalized_words)
# making dictionary and corpus
dictionary = corpora.Dictionary(doc_clean)
# convert tokenized documents into a document-term matrix
matrx= [dictionary.doc2bow(text) for text in doc_clean]
# generate LDA model
lda = models.ldamodel.LdaModel(corpus=matrx, id2word=dictionary, num_topics=5, passes=10)
for top in lda.print_topics():
print("\n===topics from files===\n")
print (top)
LDA and its drawbacks: The idea of LDA is to uncover latent topics from your corpus. A drawback of this unsupervised machine learning approach, is that you will end up with topics that may be hard to interpret by humans. Another drawback is that you will most likely end up with some generic topics including words that appear in every document (like 'introduction', 'date', 'author' etc.). Thirdly, you will not be able to uncover latent topics that are simply not present enough. If you have only 1 article about cricket, it will not be recognised by the algorithm.
Why LDA doesn't fit your case:
You are searching for explicit topics like cricket and you want to learn something about cricket vocabulary, correct? However, LDA will output some topics and you need to recognise cricket vocabulary in order to determine that e.g. topic 5 is concerned with cricket. Often times the LDA will identify topics that are mixed with other -related- topics. Keeping this in mind, there are three scenarios:
You don't know anything about cricket, but you are able to identify the topic that's concerned with cricket.
You are a cricket expert and already know the cricket vocabulary
You don't know anything about cricket and are not able to identify the semantic topic that the LDA produced.
In the first case, you will have the problem that you are likely to associate words with cricket, that are actually not related to cricket, because you count on the LDA output to provide high-quality topics that are only concerned with cricket and no other related topics or generic terms. In the second case, you don't need the analysis in the first place, because you already know the cricket vocabulary! The third case is likely when you are relying on your computer to interpret the topics. However, in LDA you always rely on humans to give a semantic interpretation of the output.
So what to do: There's a paper called Targeted Topic Modeling for Focused Analysis (Wang 2016), which tries to identify which documents are concerned with a pre-defined topic (like cricket). If you have a list of topics for which you'd like to get some topic-specific vocabulary (cricket, basketball, romantic comedies, ..), a starting point could be to first identify relevant documents to then proceed and analyse the word-distributions of the documents related to a certain topic.
Note that perhaps there are completely different methods that will perform exactly what you're looking for. If you want to stay in the LDA-related literature, I'm relatively confident that the article I linked is your best shot.
Edit:
If this answer is useful to you, you may find my paper interesting, too. It takes a labeled dataset of academic economics papers (600+ possible labels) and tries various LDA flavours to get the best predictions on new academic papers. The repo contains my code, documentation and also the paper itself
Related
I've been experimenting with NLP, and use the Doc2Vec model.
The aim of my objective, is a forum suggested question feature. For example, If a user types a question it will compare the vector to other questions already asked. So far this has worked ok in the sense of comparing a question to another asked question.
However, I would like to extend this to comparing the body of the question. For example, just like stackoverflow, I'm writing a the description to my question.
I understand that doc2vec represents sentences through paragraph ids. So for my question example I spoke about first, each sentence will be a unique paragraph id. However, with paraphs i.e the body to the question, sentences will have the same id as other sentences apart of the same paragraph.
para = 'This is a sentence. This is another sentence'
[['This','is','a','sentence',tag=[1]], ['This','is','another','sentence',tag=[1]]
I'm wondering how to go about doing this. How can i input a corpus like so:
['It is a nice day today. I wish I was outside in the sun. But I need to work.']
and compare that to another paragraph like this:
['It is a lovely day today. The sun is shining outside today. However, I am working.']
In which I would expect a very close similarity between the two. Does similarity get calculated by sentence to sentence, rather then paragraph to paragraph? i.e.
cosine_sim(['It is a nice day today'],['It is a lovely day today.]
and do this for the other sentences and average out the similarity scores?
Thanks.
EDIT
What I am confused about is using the above sentences, say the vectors are like so
sent1 = [0.23,0.1,0.33...n]
sent2 = [0.78,0.2,-0.6...n]
sent3 = [0.55,-0.5,0.9...n]
#Avergae out these vectors
para = [0.5,0.2,0.3...n]
and using this vector compare to another paragraph using the same process.
I'll presume you're talking about the Doc2Vec model in the Python Gensim library - based on the word2vec-like 'Paragraph Vector' algorithm. (There are many alternate ways to turn a text into a vector, and sometimes other ways, including the very-simple approach of averaging word-vectors together, gets called 'Doc2Vec' also.)
Doc2Vec has no internal idea of sentences or paragraphs. It just considers texts: lists of word tokens. So you decide what-sized chunks of text to provide, and to associate with tag keys: multiword fragments, sentences, paragraphs, sections, chapters, articles, books, whatever.
Every tag you provide during initial bulk training will have an associated vector trained up, and stored in the model, based on the lists-of-words that you provided alongside it. So, you can retrieve those vectors from training via:
d2v_model.dv[tag]
You can also use that trained, frozen model to infer new vectors for new lists-of-words:
d2v_model.infer_vector(list_of_words)
(Note: these words should be preprocessed/tokenized the same was as those during training, and any words not known to the model from training will be silently ignored.)
And, once you have vectors for two different texts, from whatever method, you can compare them via cosine-similarity.
For creating your doc-vectors, you might want to run the question & body together into one text. (If the question is more important, you could even consider training on a pseudotext that's repeats the question more than once, for example both before and after the body.) Or you might want to treat them separately, so that some downstream process can weight question->question similarities differently than body->body or question->body. What's best for your data & goals usually has to be determined via experimentation.
I am trying to get unique words for each topic.
I am using gensim and this is the line that help me to generate my model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary)
But I have repeated words in two different topics, I would like to have different words per topic
You cannot enforce words uniqueness by topic in LDA since each topic is a distribution over all the words in the vocabulary. This distribution measure the probability that words co-occur inside a topic. Thus, nothing ensures that a word won't co-occur with different words in different contexts, which will leads to words represented in different topics.
Let's take an example by considering these two documents:
doc1: The python is a beautiful snake living in the forest.
doc2: Python is a beautiful language used by programmer and data scientist.
In doc1 the word python co-occur with snake, forest and living which might give a good probability for this word to appear in a topic, let's say, about biology.
In doc2, the word python co-occur with language, programmer and data which, in this case, will associate this word in a topic about computer science.
What you can eventually do, is to look for words that have the highest probability in topics in order to achieve what you want.
Words that are grouped into one topic do not mean that they are semantically similar(low distance in space mapped from word2vec). They are just co-occurred more often.
I am using gensim to do LDA on a corpus of arXiv abstracts in the category stats.ML
My problem is that there is a lot of overlap between the topics (whether I pick 5, 10, or 50 topics). Every topic has a distribution of words like "model" "algorithm", or "problem." How can topics be considered differentiable if so many of them prominently feature the same terms?
Using pyLDAvis was instructive for me . This is the distribution for topic #3:
But when I turn down lambda = 0.08, the actual nature of the topic emerges (ML in medical applications):
So my question is, how could I uncover these distinctive terms in the course of training my LDA model (without pyLDAvis)? And also, does the performance (as opposed to interpret-ability) of the model improve if I can get it to ignore these common, non-discriminating terms?
I have several ideas to try but would like more guidance:
Filtering the 50 most common terms from my dictionary. While I think it helped a bit, I'm not sure if it's the right approach
Tweaking eta parameter in gensim.models.LdaModel
My goal is ultimately to take a new document and do word coloring on it based on which words relate to which topics, and also offer the documents most similar to the input document.
I am pretty new with gensim, and this is my first SO question, so if I'm totally off-base with something, please let me know ;-). Thank you
I am running LDA on health-related data. Specifically I have ~500 documents that contain interviews that last around 5-7 pages. While I cannot really go into the details of the data or results due to preserving data integrity/confidentiality, I will describe the results and go through the procedure to give a better idea of what I am doing and where I can improve.
For the results, I chose 20 topics and outputted 10 words per topic. Although 20 was somewhat arbitrary and I did not have a clear idea of a good amount of topics, that seemed like a good amount given the size of the data and that they are all health-specific. However, the results highlighted two issues: 1) it is unclear what the topics were since the words within each topic did not necessarily go together or tell a story and 2) many of the words among the various topics overlapped, and there were a few words that showed up in most topics.
In terms of what I did, I first preprocessed the text. I converted everything to lowercase, removed punctuation, removed unnecessary codings specific to the set of documents at hand. I then tokenized the documents, lemmatized the words, and performed tf-idf. I used sklearn's tf-idf capabilities and within tf-idf initialization, I specified a customized list of stopwords to be removed (which added to nltk's set of stopwords). I also set max_df to 0.9 (unclear what a good number is, I just played around with different values), min_df to 2, and max_features to 5000. I tried both tf-idf and bag of words (count vectorizer), but I found tf-idf to give slightly clearer and more distinct topics while analyzing the LDA output. After this was done, I then ran an LDA model. I set the number of topics to be 20 and the number of iterations to 5.
From my understanding, each decision I made above may have contributed to the LDA model's ability to identify clear, meaningful topics. I know that text processing plays a huge role in LDA performance, and the better job I do there, the more insightful the LDA will be.
Is there anything glaringly wrong or something I missed out. Do you have any suggested values/explorations for any of the parameters I described above?
How detailed, nit-picky should I be when filtering out potential domain-specific stopwords?
How do I determine a good number of topics and iterations during the LDA step?
How can I go about validating performance, other than qualitatively comparing output?
I appreciate all insights and input. I am completely new to the area of topic modeling and while I have read some articles, I have a lot to learn! Thank you!
How do I determine a good number of topics and iterations during the LDA step?
This is the most difficult question in clustering algorithms like LDA. There is a metric that can determine which number of cluster is the best https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_coherence_tutorial.ipynb
In my experience optimizing this metric by tuning number of topics, iterations or another hyper-parameters won't necessarily give you interpretable topics.
How can I go about validating performance, other than qualitatively comparing output?
Again you may use the above metric to validate the performance, but I also found useful visualization of the topics http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb
This not only gives you topic histograms but also shows how apart are the topics, which again may help with finding out optimal number of topics.
In my studies I was not using scikit, but rather gensim.
Latent Dirichlet Allocation(LDA) is a topic model to find latent variable (topics) underlying a bunch of documents. I'm using python gensim package and having two problems:
I printed out the most frequent words for each topic (I tried 10,20,50 topics), and found out that the distribution over words is very "flat": meaning even the most frequent word has only 1% probability...
Most of the topics are similar: meaning the most frequent words for each of the topics overlap a lot and the topics share almost the same set of words for their high frequency words...
I guess the problem is probably due to my documents: my documents actually belong to a specific category, for example, they are all documents introducing different online games. For my case, will LDA still work, since the documents themselves are quite similar, so a model based on "bag of words" may not be a good way to try?
Could anyone give me some suggestions? Thank you!
I've found NMF to perform better when a corpus is smaller and more focused around a particular topic. In a corpus of ~250 documents all discussing the same issue NMF was able to pull 7 distinct coherent topics. This has also been reported by other researchers...
"Another advantage that is particularly useful for the appli- cation
presented in this paper is that NMF is capable of identifying niche
topics that tend to be under-reported in traditional LDA approaches" (p.6)
Greene & Cross, Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach, PDF
Unfortunately Gensim doesn't have an implementation of NMF but it is in Scikit-Learn. To work effectively you need to feed NMF some TFIDF weighted word vectors rather than frequency counts like you do with LDA.
If you're used to Gensim and have preprocessed everything that way genesis has some utilities to convert a corpus top Scikit compatible structures. However I think it would actually be simpler to just use all Scikit. There is a good example of using NMF here.