Meaning behind converting LDA topics to "suitable" TF-IDF matrices - python

As a beginner in text mining, I am trying to replicate the analyses from this paper. Essentially, the authors extract LDA topics (1-4) from a document and then "the topics extracted by LDA have been converted to suitable TF-IDF matrices that have been then used to predict..." (not important what they predict, it's a bunch of regressions). Their definition of TF and IDF (section 4.2.5) seems common, though, my understanding is that the TF-IDF measures apply to a word in a document, not topics. Given that they have a case where they extract a single topic, I think it's impossible to use the probability of the topic in a document, as this will always be 1 (though correct me if I am wrong).
So, what are the possible interpretations of converting LDA topics to "suitable TF-IDF" matrices? (and how would one go about doing that using the below code?)
Would that mean converting each and every word in a document to its TF-IDF weight and then use in prediction? That does not seem plausible as with 1000+ documents, that'd be pretty high and almost certainly most of them would be useless.
Minimally reproducible example
(credit: Jordan Barber)
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = get_stop_words('en')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
# add tokens to list
texts.append(stemmed_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

Related

LDA Topic Modelling : Topics predicted from huge corpus make no sense

I am using LDA for a Topic Modelling task. As suggested in various forums online, I have trained my model on a fairly large corpus : NYTimes news dataset (~ 200 MB csv file) which has reports regarding a wide variety of news topics.
Surprisingly the topics predicted out of it are mostly related to US politics and when I test it on a new document regarding 'how to educate children and parenting stuff' it predicts the most likely topic as this :
['two', 'may', 'make', 'company', 'house', 'things', 'case', 'use']
Kindly have a look at my model :
def ldamodel_english(filepath, data):
data_words = simple_preprocess(str(data), deacc=True)
# Building the bigram model and removing stopwords
bigram = Phrases(data_words, min_count=5, threshold=100)
bigram_mod = Phraser(bigram)
stop_words_english = stopwords.words('english')
data_nostops = [[word for word in simple_preprocess(str(doc)) if word not in stop_words_english]
for doc in data_words]
data_bigrams = [bigram_mod[doc] for doc in data_nostops]
data_bigrams = [x for x in data_bigrams if x != []]
# Mapping indices to words for computation purpose
id2word = corpora.Dictionary(data_bigrams)
corpus = [id2word.doc2bow(text) for text in data_bigrams]
# Building the LDA model. The parameters 'alpha' and 'eta' handle the number of topics per document and words per topic respectively
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=id2word, num_topics=20, random_state=10, iterations=100,
update_every=1, chunksize=1000, passes=8, alpha=0.09, per_word_topics=True, eta=0.8)
print('\nPerplexity Score: ' + str(lda_model.log_perplexity(corpus)) + '\n')
for i, topic in lda_model.show_topics(formatted=True, num_topics=20, num_words=10):
print('TOPIC #' + str(i) + ': ' + topic + '\n')
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_bigrams, dictionary=id2word, coherence='c_v')
print('\nCoherence Score: ', coherence_model_lda.get_coherence())
saved_model_path = os.path.join(filepath, 'ldamodel_english')
lda_model.save(saved_model_path)
return saved_model_path, corpus, id2word
The 'data' part comes from the 'Content' section of the NYTimes News dataset and I used GENSIM library for LDA.
My question is if a well trained LDA model predicts so badly why there is such a hype and what is an effective alternative method?
It can be a perfectly valid output of the model. Given the source texts which are not necessary related to "children education and parenting" the topic that was found to be the most similar might just be very rudimentarily similar to the article. It is likely that there is not much of the vocabulary in common between NY Times articles and your article. So the words that made the topic distinctive among the topics typical for NY Times might have very little in common with your article. In fact, the only words that are shared may be really rather typical of anything as in your case.
This is happening frequently when the corpus used for training the LDA model has little to do with the documents it is applied to later. So there is really not much surprise here. The size of the corpus does not help as what matters is the vocabulary/topical overlap.
I suggest that you either change the number of topics and the corpus or find a suitable corpus to train LDA on (that contains texts related to the documents you intend to classify).

Matching set of words with set of sentences in python nlp

I have a use case where I want to match one list of words with a list of sentences and bring the most relevant sentences
I am working in python. What I have already tried is using KMeans where we cluster our set of documents into the clusters and then predict the sentence that in which structure it resides. But in my case I have already available list of words available.
def getMostRelevantSentences():
Sentences = ["This is the most beautiful place in the world.",
"This man has more skills to show in cricket than any other game.",
"Hi there! how was your ladakh trip last month?",
"Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]
words = ["cricket","sports","team","play","match"]
#TODO: now this should return me the 2nd and last item from the Sentences list as the words list mostly matches with them
So from the above code I want to return the sentences which are closely matching with the words provided. I don't want to use the supervised machine learning here. Any help will be appreciated.
So finally I have used this super library called gensim to generate the similarity.
import gensim
from nltk.tokenize import word_tokenize
def getSimilarityScore(raw_documents, words):
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.Similarity('/usr/workdir',tf_idf[corpus],
num_features=len(dictionary))
query_doc_bow = dictionary.doc2bow(words)
query_doc_tf_idf = tf_idf[query_doc_bow]
return sims[query_doc_tf_idf]
You can use this method as:
Sentences = ["This is the most beautiful place in the world.",
"This man has more skills to show in cricket than any other game.",
"Hi there! how was your ladakh trip last month?",
"Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]
words = ["cricket","sports","team","play","match"]
words_lower = [w.lower() for w in words]
getSimilarityScore(Sentences,words_lower)

How to remove stop words from documents in gensim?

I'm building a NLP chat application using Doc2Vec technique in Python using its gensim package. I have already done tokenizing and stemming. I want to remove the stop words (to test if it works better) from both the training set as well as the question which user throws.
Here is my code.
import gensim
import nltk
from gensim import models
from gensim import utils
from gensim import corpora
from nltk.stem import PorterStemmer
ps = PorterStemmer()
sentence0 = models.doc2vec.LabeledSentence(words=[u'sampl',u'what',u'is'],tags=["SENT_0"])
sentence1 = models.doc2vec.LabeledSentence(words=[u'sampl',u'tell',u'me',u'about'],tags=["SENT_1"])
sentence2 = models.doc2vec.LabeledSentence(words=[u'elig',u'what',u'is',u'my'],tags=["SENT_2"])
sentence3 = models.doc2vec.LabeledSentence(words=[u'limit', u'what',u'is',u'my'],tags=["SENT_3"])
sentence4 = models.doc2vec.LabeledSentence(words=[u'claim',u'how',u'much',u'can',u'I'],tags=["SENT_4"])
sentence5 = models.doc2vec.LabeledSentence(words=[u'retir',u'i',u'am',u'how',u'much',u'can',u'elig',u'claim'],tags=["SENT_5"])
sentence6 = models.doc2vec.LabeledSentence(words=[u'resign',u'i',u'have',u'how',u'much',u'can',u'i',u'claim',u'elig'],tags=["SENT_6"])
sentence7 = models.doc2vec.LabeledSentence(words=[u'promot',u'what',u'is',u'my',u'elig',u'post',u'my'],tags=["SENT_7"])
sentence8 = models.doc2vec.LabeledSentence(words=[u'claim',u'can,',u'i',u'for'],tags=["SENT_8"])
sentence9 = models.doc2vec.LabeledSentence(words=[u'product',u'coverag',u'cover',u'what',u'all',u'are'],tags=["SENT_9"])
sentence10 = models.doc2vec.LabeledSentence(words=[u'hotel',u'coverag',u'cover',u'what',u'all',u'are'],tags=["SENT_10"])
sentence11 = models.doc2vec.LabeledSentence(words=[u'onlin',u'product',u'can',u'i',u'for',u'bought',u'through',u'claim',u'sampl'],tags=["SENT_11"])
sentence12 = models.doc2vec.LabeledSentence(words=[u'reimburs',u'guidelin',u'where',u'do',u'i',u'apply',u'form',u'sampl'],tags=["SENT_12"])
sentence13 = models.doc2vec.LabeledSentence(words=[u'reimburs',u'procedur',u'rule',u'and',u'regul',u'what',u'is',u'the',u'for'],tags=["SENT_13"])
sentence14 = models.doc2vec.LabeledSentence(words=[u'can',u'i',u'submit',u'expenditur',u'on',u'behalf',u'of',u'my',u'friend',u'and',u'famili',u'claim',u'and',u'reimburs'],tags=["SENT_14"])
sentence15 = models.doc2vec.LabeledSentence(words=[u'invoic',u'bills',u'procedur',u'can',u'i',u'submit',u'from',u'shopper stop',u'claim'],tags=["SENT_15"])
sentence16 = models.doc2vec.LabeledSentence(words=[u'invoic',u'bills',u'can',u'i',u'submit',u'from',u'pantaloon',u'claim'],tags=["SENT_16"])
sentence17 = models.doc2vec.LabeledSentence(words=[u'invoic',u'procedur',u'can',u'i',u'submit',u'invoic',u'from',u'spencer',u'claim'],tags=["SENT_17"])
# User asks a question.
document = input("Ask a question:")
tokenized_document = list(gensim.utils.tokenize(document, lowercase = True, deacc = True))
#print(type(tokenized_document))
stemmed_document = []
for w in tokenized_document:
stemmed_document.append(ps.stem(w))
sentence19 = models.doc2vec.LabeledSentence(words= stemmed_document, tags=["SENT_19"])
# Building vocab.
sentences = [sentence0,sentence1,sentence2,sentence3, sentence4, sentence5,sentence6, sentence7, sentence8, sentence9, sentence10, sentence11, sentence12, sentence13, sentence14, sentence15, sentence16, sentence17, sentence19]
#I tried to remove the stop words but it didn't work out as LabeledSentence object has no attribute lower.
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in sentences]
..
Is there a way I can remove stop words from sentences directly and get a new set of vocab without stop words ?
Your sentences object is a already a list of LabeledSentence objects. You construct these above; they include a list-of-strings in words and a list-of-strings in tags.
So each item in that list (document in your list-comprehension) can't have a string method like .lower() applied to it. (Nor would it need to be .split(), as its words are already separate tokens.)
The cleanest approach would be to remove stop-words from the lists-of-words before they're used to construct LabeledSentence objects. For example, you could make a function without_stopwords(), defined at the top. Then your lines creating LabeledSentence objects could instead be like:
sentence0 = LabeledSentence(words=remove_stopwords([u'sampl', u'what', u'is']),
tags=["SENT_0"])
Alternatively, you could mutate the existing LabeledSentence objects so that each of their words attributes now lack stop-words. This would replace your last line with something more like:
for doc in sentences:
doc.words = [word for word in doc.words if word not in stoplist]
texts = sentences
Separately, things you didn't ask but should know:
TaggedDocument is now the preferred example-class name for Doc2Vec text objects – but in fact any object that has the two required properties words and tags will work fine.
Doc2Vec doesn't show many of the desired properties on tiny, toy-sized datasets – don't be surprised if a model built on dozens of sentences does not do anything useful, or misleads about what preprocessing/meta-parameter options are best. (Tens of thousands of texts, and texts at least tens-of-words long, are much better for meaningful results.)
Much Word2Vec/Doc2Vec work doesn't bother with stemming or stop-word removal, but it may sometimes be helpful.

How to cluster documents under topics using latent semantic analysis (lsa)

I've been working on latent semantic analysis (lsa) and applied this example: https://radimrehurek.com/gensim/tut2.html
It includes the terms clustering under topics but couldn't find anything how we can cluster documents under topics.
In that example, it says that 'It appears that according to LSI, “trees”, “graph” and “minors” are all related words (and contribute the most to the direction of the first topic), while the second topic practically concerns itself with all the other words. As expected, the first five documents are more strongly related to the second topic while the remaining four documents to the first topic'.
How can we relate those five documents with Python code to the related topic?
You can find my python code below. I would appreciate any help.
from numpy import asarray
from gensim import corpora, models, similarities
#https://radimrehurek.com/gensim/tut2.html
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]
dictionary = corpora.Dictionary(texts)
corp = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corp) # step 1 -- initialize a model
corpus_tfidf = tfidf[corp]
# extract 400 LSI topics; use the default one-pass algorithm
lsi = models.lsimodel.LsiModel(corpus=corp, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus_tfidf]
#for i in range(0, lsi.num_topics-1):
for i in range(0, 3):
print lsi.print_topics(i)
for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
print(doc)
corpus_lsi has a list of 9 vectors, which is the number of documents.
Each vector stores in at its i-th index the likeliness that this document belongs to topic i.
If you just want to assign a document to 1 topic, choose the topic-index with the highest value in your vector.

Latent Dirichlet allocation(LDA) performance by limiting word size for Corpus Documents

I have been generating topics with yelp data set of customer reviews by using Latent Dirichlet allocation(LDA) in python(gensim package). While generating tokens, I am selecting only the words having length >= 3 from the reviews( By using RegexpTokenizer):
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w{3,}')
tokens = tokenizer.tokenize(review)
This will allow us to filter out the noisy words of length less than 3, while creating the corpus document.
How will filtering out these words effect performance with the LDA algorithm?
Generally speaking, for the English language, one and two letter words don't add information about the topic. If they don't add value they should be removed during the pre-processing step. Like most algorithms, less data in will speed up the execution time.
Words less than length 3 are considered stop words. LDAs build topics so imagine you generate this topic:
[I, him, her, they, we, and, or, to]
compared to:
[shark, bull, greatwhite, hammerhead, whaleshark]
Which is more telling? This is why it is important to remove stopwords. This is how I do that:
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stopword2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result

Categories