How to cluster documents under topics using latent semantic analysis (lsa) - python

I've been working on latent semantic analysis (lsa) and applied this example: https://radimrehurek.com/gensim/tut2.html
It includes the terms clustering under topics but couldn't find anything how we can cluster documents under topics.
In that example, it says that 'It appears that according to LSI, “trees”, “graph” and “minors” are all related words (and contribute the most to the direction of the first topic), while the second topic practically concerns itself with all the other words. As expected, the first five documents are more strongly related to the second topic while the remaining four documents to the first topic'.
How can we relate those five documents with Python code to the related topic?
You can find my python code below. I would appreciate any help.
from numpy import asarray
from gensim import corpora, models, similarities
#https://radimrehurek.com/gensim/tut2.html
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]
dictionary = corpora.Dictionary(texts)
corp = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corp) # step 1 -- initialize a model
corpus_tfidf = tfidf[corp]
# extract 400 LSI topics; use the default one-pass algorithm
lsi = models.lsimodel.LsiModel(corpus=corp, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus_tfidf]
#for i in range(0, lsi.num_topics-1):
for i in range(0, 3):
print lsi.print_topics(i)
for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
print(doc)

corpus_lsi has a list of 9 vectors, which is the number of documents.
Each vector stores in at its i-th index the likeliness that this document belongs to topic i.
If you just want to assign a document to 1 topic, choose the topic-index with the highest value in your vector.

Related

Meaning behind converting LDA topics to "suitable" TF-IDF matrices

As a beginner in text mining, I am trying to replicate the analyses from this paper. Essentially, the authors extract LDA topics (1-4) from a document and then "the topics extracted by LDA have been converted to suitable TF-IDF matrices that have been then used to predict..." (not important what they predict, it's a bunch of regressions). Their definition of TF and IDF (section 4.2.5) seems common, though, my understanding is that the TF-IDF measures apply to a word in a document, not topics. Given that they have a case where they extract a single topic, I think it's impossible to use the probability of the topic in a document, as this will always be 1 (though correct me if I am wrong).
So, what are the possible interpretations of converting LDA topics to "suitable TF-IDF" matrices? (and how would one go about doing that using the below code?)
Would that mean converting each and every word in a document to its TF-IDF weight and then use in prediction? That does not seem plausible as with 1000+ documents, that'd be pretty high and almost certainly most of them would be useless.
Minimally reproducible example
(credit: Jordan Barber)
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = get_stop_words('en')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
# add tokens to list
texts.append(stemmed_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

Understanding LDA / topic modelling -- too much topic overlap

I'm new to topic modelling / Latent Dirichlet Allocation and have trouble understanding how I can apply the concept to my dataset (or whether it's the correct approach).
I have a small number of literary texts (novels) and would like to extract some general topics using LDA.
I'm using the gensim module in Python along with some nltk features. For a test I've split up my original texts (just 6) into 30 chunks with 1000 words each. Then I converted the chunks into document-term matrices and ran the algorithm. This is the code (although I think it doesn't matter for the question) :
# chunks is a 30x1000 words matrix
dictionary = gensim.corpora.dictionary.Dictionary(chunks)
corpus = [ dictionary.doc2bow(chunk) for chunk in chunks ]
lda = gensim.models.ldamodel.LdaModel(corpus = corpus, id2word = dictionary,
num_topics = 10)
topics = lda.show_topics(5, 5)
However the result is completely different from any example I've seen in that the topics are full of meaningless words that can be found in all source documents, e.g. "I", "he", "said", "like", ... example:
[(2, '0.009*"I" + 0.007*"\'s" + 0.007*"The" + 0.005*"would" + 0.004*"He"'),
(8, '0.012*"I" + 0.010*"He" + 0.008*"\'s" + 0.006*"n\'t" + 0.005*"The"'),
(9, '0.022*"I" + 0.014*"\'s" + 0.009*"``" + 0.007*"\'\'" + 0.007*"like"'),
(7, '0.010*"\'s" + 0.009*"I" + 0.006*"He" + 0.005*"The" + 0.005*"said"'),
(1, '0.009*"I" + 0.009*"\'s" + 0.007*"n\'t" + 0.007*"The" + 0.006*"He"')]
I don't quite understand why that happens, or why it doesn't happen with the examples I've seen. How do I get the LDA model to find more distinctive topics with less overlap? Is it a matter of filtering out more common words first? How can I adjust how many times the model runs? Is the number of original texts too small?
LDA is extremely dependent on the words used in a corpus and how frequently they show up. The words you are seeing are all stopwords - meaningless words that are the most frequent words in a language e.g. "the", "I", "a", "if", "for", "said" etc. and since these words are the most frequent, it will negatively impact the model.
I would use the nltk stopword corpus to filter out these words:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
Then make sure your text does not contain any of the words in the stop_words list (by whatever pre processing method you are using) - an example is below
text = text.split() # split words by space and convert to list
text = [word for word in text if word not in stop_words]
text = ' '.join(text) # join the words in the text to make it a continuous string again
You may also want to remove punctuation and other characters ("/","-") etc.) then use regular expressions:
import re
remove_punctuation_regex = re.compile(r"[^A-Za-z ]") # regex for all characters that are NOT A-Z, a-z and space " "
text = re.sub(remove_punctuation_regex, "", text) # sub all non alphabetical characters with empty string ""
Finally, you may also want to filter on most frequent or least frequent words in your corpus, which you can do using nltk:
from nltk import FreqDist
all_words = text.split() # list of all the words in your corpus
fdist = FreqDist(all_words) # a frequency distribution of words (word count over the corpus)
k = 10000 # say you want to see the top 10,000 words
top_k_words, _ = zip(*fdist.most_common(k)) # unzip the words and word count tuples
print(top_k_words) # print the words and inspect them to see which ones you want to keep and which ones you want to disregard
That should get rid of the stopwords and extra characters, but still leaves the vast problem of topic modelling (which I wont try to explain here but will leave some tips and links).
Assuming you know a little bit about topic modelling, lets start. LDA is a bag of words model, meaning word order doesnt matter. The model assigns a topic distribution (of a predetermined number of topics K) to each document, and a word distribution to each topic. A very insightful high level video explains this here. If you want to see more of the mathematics, but still at an accessible level, check out this video. The more documents the better, and usually longer documents (with more words) also fair better using LDA - this paper shows that LDA doesnt perform well with short texts (less than ~20 words). K is up to you to choose, and really depends on your corpus of documents (how large it is, what different topics it covers etc.). Usually a good value of K is between 100-300, but again this really depends on your corpus.
LDA has two hyperparamters, alpha and beta (alpha and eta in gemsim) - a higher alpha means each text will be represented by more topics (so naturally a lower alpha means each text will be represented by less topics). A high eta means each topic is represented by more words, and a low eta means each topic is represented by less words - so with a low eta you would get less "overlap" between topics.
There's many insights you could gain using LDA
What are the topics in a corpus (naming topics may not matter to your application, but if it does this can be done by inspecting the words in a topic as you have done above)
What words contribute most to a topic
What documents in the corpus are most similar (using a similarity metric)
Hope this has helped. I was new to LDA a few months ago but I've quickly gotten up to speed using stackoverflow and youtube!

Latent Dirichlet allocation(LDA) performance by limiting word size for Corpus Documents

I have been generating topics with yelp data set of customer reviews by using Latent Dirichlet allocation(LDA) in python(gensim package). While generating tokens, I am selecting only the words having length >= 3 from the reviews( By using RegexpTokenizer):
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w{3,}')
tokens = tokenizer.tokenize(review)
This will allow us to filter out the noisy words of length less than 3, while creating the corpus document.
How will filtering out these words effect performance with the LDA algorithm?
Generally speaking, for the English language, one and two letter words don't add information about the topic. If they don't add value they should be removed during the pre-processing step. Like most algorithms, less data in will speed up the execution time.
Words less than length 3 are considered stop words. LDAs build topics so imagine you generate this topic:
[I, him, her, they, we, and, or, to]
compared to:
[shark, bull, greatwhite, hammerhead, whaleshark]
Which is more telling? This is why it is important to remove stopwords. This is how I do that:
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stopword2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result

LSI Clustering using gensim in python

I'm trying to run this example code in Python 2.7 for LSI text clustering.
import gensim
from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]
dictionary = corpora.Dictionary(texts)
corp = [dictionary.doc2bow(text) for text in texts]
# extract 400 LSI topics; use the default one-pass algorithm
lsi = gensim.models.lsimodel.LsiModel(corpus=corp, id2word=dictionary, num_topics=400)
# print the most contributing words (both positively and negatively) for each of the first ten topics
lsi.print_topics(10)
But it returns this error for the 2nd last command.
Warning (from warnings module):
File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2499
VisibleDeprecationWarning)
VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
Please let me know what I am doing wrong or if i need to update anything to make it work.

How to print out the full distribution of words in an LDA topic in gensim?

The lda.show_topics module from the following code only prints the distribution of the top 10 words for each topic, how do i print out the full distribution of all the words in the corpus?
from gensim import corpora, models
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
for i in lda.show_topics():
print i
There is a variable call topn in show_topics() where you can specify the number of top N words you require from the words distribution over each topic. see http://radimrehurek.com/gensim/models/ldamodel.html
So instead of the default lda.show_topics(). You can use the len(dictionary) for the full word distributions for each topic:
for i in lda.show_topics(topn=len(dictionary)):
print i
There are two variable call num_topics and num_words in show_topics(),for num_topics number of topics, return num_words most significant words (10 words per topic, by default). see http://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.show_topics
So you can use the len(lda.id2word) for the full words distributions for each topic,and the lda.num_topics for the all topics in your lda model.
for i in lda.show_topics(formatted=False,num_topics=lda.num_topics,num_words=len(lda.id2word)):
print i
The below code will print your words as well as their probability. I have printed top 10 words. You can change num_words = 10 to print more words per topic.
for words in lda.show_topics(formatted=False,num_words=10):
print(words[0])
print("******************************")
for word_prob in words[1]:
print("(",dictionary[int(word_prob[0])],",",word_prob[1],")",end = "")
print("")
print("******************************")

Categories