LSI Clustering using gensim in python - python

I'm trying to run this example code in Python 2.7 for LSI text clustering.
import gensim
from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]
dictionary = corpora.Dictionary(texts)
corp = [dictionary.doc2bow(text) for text in texts]
# extract 400 LSI topics; use the default one-pass algorithm
lsi = gensim.models.lsimodel.LsiModel(corpus=corp, id2word=dictionary, num_topics=400)
# print the most contributing words (both positively and negatively) for each of the first ten topics
lsi.print_topics(10)
But it returns this error for the 2nd last command.
Warning (from warnings module):
File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2499
VisibleDeprecationWarning)
VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
Please let me know what I am doing wrong or if i need to update anything to make it work.

Related

Difference between 2 kinds of text corpus vocabulary creations with spacy

I am trying to retrieve the vocabulary of a text corpus with spacy.
The corpus is a list of strings with each string being a document from the corpus.
I came up with 2 different methods to create the vocabulary. Both work but yield slightly different results and i dont know why.
The first approach results in a vocabulary size of 5000:
words = nlp(" ".join(docs))
vocab2 = []
for word in words:
if word.lemma_ not in vocab2 and word.is_alpha == True:
vocab2.append(word.lemma_)
The second approach results in a vocabulary size of 5001 -> a single word more:
vocab = set()
for doc in docs:
doc = nlp(doc)
for token in doc:
if token.is_alpha:
vocab.add(token.lemma_)
Why do the 2 results differ?
My best guess would be that the model behind nlp() somehow interprets different token when it has the input as a whole vs. the input per document.

Meaning behind converting LDA topics to "suitable" TF-IDF matrices

As a beginner in text mining, I am trying to replicate the analyses from this paper. Essentially, the authors extract LDA topics (1-4) from a document and then "the topics extracted by LDA have been converted to suitable TF-IDF matrices that have been then used to predict..." (not important what they predict, it's a bunch of regressions). Their definition of TF and IDF (section 4.2.5) seems common, though, my understanding is that the TF-IDF measures apply to a word in a document, not topics. Given that they have a case where they extract a single topic, I think it's impossible to use the probability of the topic in a document, as this will always be 1 (though correct me if I am wrong).
So, what are the possible interpretations of converting LDA topics to "suitable TF-IDF" matrices? (and how would one go about doing that using the below code?)
Would that mean converting each and every word in a document to its TF-IDF weight and then use in prediction? That does not seem plausible as with 1000+ documents, that'd be pretty high and almost certainly most of them would be useless.
Minimally reproducible example
(credit: Jordan Barber)
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = get_stop_words('en')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
# add tokens to list
texts.append(stemmed_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

How to cluster documents under topics using latent semantic analysis (lsa)

I've been working on latent semantic analysis (lsa) and applied this example: https://radimrehurek.com/gensim/tut2.html
It includes the terms clustering under topics but couldn't find anything how we can cluster documents under topics.
In that example, it says that 'It appears that according to LSI, “trees”, “graph” and “minors” are all related words (and contribute the most to the direction of the first topic), while the second topic practically concerns itself with all the other words. As expected, the first five documents are more strongly related to the second topic while the remaining four documents to the first topic'.
How can we relate those five documents with Python code to the related topic?
You can find my python code below. I would appreciate any help.
from numpy import asarray
from gensim import corpora, models, similarities
#https://radimrehurek.com/gensim/tut2.html
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once] for text in texts]
dictionary = corpora.Dictionary(texts)
corp = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corp) # step 1 -- initialize a model
corpus_tfidf = tfidf[corp]
# extract 400 LSI topics; use the default one-pass algorithm
lsi = models.lsimodel.LsiModel(corpus=corp, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus_tfidf]
#for i in range(0, lsi.num_topics-1):
for i in range(0, 3):
print lsi.print_topics(i)
for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly
print(doc)
corpus_lsi has a list of 9 vectors, which is the number of documents.
Each vector stores in at its i-th index the likeliness that this document belongs to topic i.
If you just want to assign a document to 1 topic, choose the topic-index with the highest value in your vector.

Deleting words from text list

I am trying to remove certain words (in addition to using stopwords) from the list of text strings but it is not working for some reason
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
exclude = ['am', 'there','here', 'for', 'of', 'user']
new_doc = [word for word in documents if word not in exclude]
print new_doc
OUTPUT
['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well quasi ordering', 'Graph minors A survey']
As you can see, no words in EXCLUDE are removed from the DOCUMENTS (e.g. "for" is a prime example)
it works with this operator:
new_doc = [word for word in str(documents).split() if word not in exclude]
but then how do I get back the initial elements (albeit "cleaned ones") in DOCUMENTS?
I will greatly appreciate your help!
You should split lines to words before filter them:
new_doc = [' '.join([word for word in line.split() if word not in exclude]) for line in documents]
You are looping over the sentences not the words.For that aim you need to split the sentences and use a nested loop to loop over your words and filter them then join the result.
>>> new_doc = [' '.join([word for word in sent.split() if word not in exclude]) for sent in documents]
>>>
>>> new_doc
['Human machine interface lab abc computer applications', 'A survey opinion computer system response time', 'The EPS interface management system', 'System and human system engineering testing EPS', 'Relation perceived response time to error measurement', 'The generation random binary unordered trees', 'The intersection graph paths in trees', 'Graph minors IV Widths trees and well quasi ordering', 'Graph minors A survey']
>>>
Also instead of a nested list comprehension and splitting and filtering you can use regex to replace the exclude words with an empty string with re.sub function :
>>> import re
>>>
>>> new_doc = [re.sub(r'|'.join(exclude),'',sent) for sent in documents]
>>> new_doc
['Human machine interface lab abc computer applications', 'A survey opinion computer system response time', 'The EPS interface management system', 'System and human system engineering testing EPS', 'Relation perceived response time to error measurement', 'The generation random binary unordered trees', 'The intersection graph paths in trees', 'Graph minors IV Widths trees and well quasi ordering', 'Graph minors A survey']
>>>
r'|'.join(exclude) will concatenate the words with an pip (means logical OR in regex).

How to print out the full distribution of words in an LDA topic in gensim?

The lda.show_topics module from the following code only prints the distribution of the top 10 words for each topic, how do i print out the full distribution of all the words in the corpus?
from gensim import corpora, models
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = models.ldamodel.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=2)
for i in lda.show_topics():
print i
There is a variable call topn in show_topics() where you can specify the number of top N words you require from the words distribution over each topic. see http://radimrehurek.com/gensim/models/ldamodel.html
So instead of the default lda.show_topics(). You can use the len(dictionary) for the full word distributions for each topic:
for i in lda.show_topics(topn=len(dictionary)):
print i
There are two variable call num_topics and num_words in show_topics(),for num_topics number of topics, return num_words most significant words (10 words per topic, by default). see http://radimrehurek.com/gensim/models/ldamodel.html#gensim.models.ldamodel.LdaModel.show_topics
So you can use the len(lda.id2word) for the full words distributions for each topic,and the lda.num_topics for the all topics in your lda model.
for i in lda.show_topics(formatted=False,num_topics=lda.num_topics,num_words=len(lda.id2word)):
print i
The below code will print your words as well as their probability. I have printed top 10 words. You can change num_words = 10 to print more words per topic.
for words in lda.show_topics(formatted=False,num_words=10):
print(words[0])
print("******************************")
for word_prob in words[1]:
print("(",dictionary[int(word_prob[0])],",",word_prob[1],")",end = "")
print("")
print("******************************")

Categories