Finding topics of an unseen document via Gensim - python

I am using Gensim to do some large-scale topic modeling. I am having difficulty understanding how to determine predicted topics for an unseen (non-indexed) document. For example: I have 25 million documents which I have converted to vectors in LSA (and LDA) space. I now want to figure out the topics of a new document, lets call it x.
According to the Gensim documentation, I can use:
topics = lsi[doc(x)]
where doc(x) is a function that converts x into a vector.
The problem is, however, that the above variable, topics, returns a vector. The vector is useful if I am comparing x to additional documents because it allows me to find the cosine similarity between them, but I am unable to actually return specific words that are associated with x itself.
Am I missing something, or does Gensim not have this capability?
Thank you,
EDIT
Larsmans has the answer.
I was able to show the topics by using:
for t in topics:
print lsi.show_topics(t[0])

The vector returned by [] on an LSI model is actually a list of (topic, weight) pairs. You can inspect a topic by means of the method LsiModel.show_topic

I was able to show the topics by using:
for t in topics:
print lsi.show_topics(t[0])
Just wanted to point out a tiny, but important, bug in your solution code: you need to use show_topic() function rather than the show_topic**s**() function.
P.S. I know this should be posted as a comment rather than an answer, but my current reputation score does not allow comments just yet!

Related

Documents in training data belongs to a particular topic in LDA

I am working on a problem where I have the Text data with around 10,000 documents. I have create a app where if user enters some random comment , It should display all the similar comments/documents present in the training data.
Exactly like in Stack overflow, if you ask an question it shows all related questions asked earlier.
So if anyone has any suggestions how to do it please answer.
Second I am trying LDA(Latent Dirichlet Allocation) algorithm, where I can get the topic with which my new document belongs to, but how will I get the similar documents from training data. Also how shall I choose the num_topics in LDA.
If anyone has any suggestions of algorithms other than LDA , please tell me.
You can try the following:
Doc2vec - this is an extension of the extremely popular word2vec algorithm, which maps words to an N-dimensional vector space such that words that occur in close proximity in your document will occur in close proximity in the vector space. U can use pre-trained word embeddings. Learn more on word2vec here. Doc2vec is an extension of word2vec. This will allow you to map each document to a vector of dimension N. After this, you can use any distance measure to find the most similar documents to an input document.
Word Mover's Distance - This is directly suited to your purpose and also uses word embeddings. I have used it in one of my personal projects and had achieved really good results. Find more about it here
Also, make sure you apply appropriate text cleaning before applying the algorithms. Steps like case normalization, stopword removal, punctuation removal, etc. It really depends on your dataset. Find out more here
I hope this was helpful...

Which clustering method is the standard way to go for text analytics?

Assume you have lot of text sentences which may have (or not) similarities. Now you want to cluster similar sentences for finding centroids of each cluster. Which method is the prefered way for doing this kind of clustering? K-means with TF-IDF sounds promising. Nevertheless, are there more sophisticated algorithms or better ones? Data structure is tokenized and in a one-hot encoded format.
Basically you can cluster texts using different techniques. As you pointed out, K-means with TF-IDF is one of the ways to do this. Unfortunately, only using tf-idf won't be able to "detect" semantics and to project smantically similar texts near one another in the space. However, instead of using tf-idf, you can use word embeddings, such as word2vec or glove - there is a lot of information on the net about them, just google it. Have you ever heard of topic models? Latent Dirichlet allocation (LDA) is a topic model and it observes each document as a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics (see the wikipedia link). So, basically, using a topic model you can also do some kind of grouping and assign similar texts (with a similar topic) to groups. I recommend you to read about topic models, since they are more common for such problems connected with text clustering.
I hope my answer was helpful.
in my view, You can use LDA(latent Dirichlet allocation, it is more flexible in comparison to other clustering techniques because of having Alpha and Beta vectors that can adjust to the contribution of each topic in a document and word in a topic. It can help if the documents are not of similar length or quality.

How to get similar words from a custom input dictionary of word to vectors in gensim

I am working on a document similarity problem. For each document, I retrieve the vectors for each of its words (from a pre-trained word embedding model) and average them to get the document vector. I end up having a dictionary (say, my_dict) that maps each document in my collection to its vector.
I want to feed this dictionary to gensim and for each document, get other documents in 'my_dict' that are closer to it. How could I do that?
You might want to consider rephrasing your question (from the title, you are looking for word similarity, from the description I gather you want document similarity) and adding a little more detail in the description. Without more detailed info about what you want and what you have tried, it is difficult to help you achieve what you want, because you could want to do a whole bunch of different things. That being said, I think I can help you out generally, even without know what you want gensim to do. gensim is quite powerful, and offers lots of different functionality.
Assuming your dictionary is already in gensim format, you can load it like this:
from gensim import corpora
dictionary = corpora.Dictionary.load('my_dict.dict')
There - now you can use it with gensim, and run analyses and model to your heart's desire. For similarities between words you can play around with such pre-made functions as gensim.word2vec.most_similar('word_one', 'word_two') etc.
For document similarity with a trained LDA model, see this stackoverflow question.
For a more detailed explanation, see this gensim tutorial which uses cosine similartiy as a measure of similarity between documents.
gensim has a bunch of premade functionality which do not require LDA, for example gensim.similarities.MatrixSimilarity from similarities.docsim, I would recommend looking at the documentation and examples.
Also, in order to avoid a bunch of pitfalls: Is there a specific reason to average the vectors by yourself (or even averaging them at all)? You do not need to do this (gensim has a few more sophisticated methods that achieve a mapping of documents to vectors for you, like models.doc2vec), and might lose valuable information.

custom feature function with (python) crfsuite

I have only read theorey about CRF so far and want to use python crfsuite in my master thesis for extracting ingredients from recipes. Every help is appreciated.
As far as I understand, I can provide training data to crfsuite in the form of the picture below, where w[0] provides the identity of the current word, w[i] the world relative to i and pos[i] its part-of-speech-tag relative to i.
And then crfsuite trains its own feature functions build on the given attributes.
But I can't find a way for providing custom feature functions like "w[i] is in a dictionary" (for example a dictionary of recipe ingredients) or "in the sentence is a negation" (for example "not", or "don't").
In general good tutorials are appreciated, because the manuals (https://python-crfsuite.readthedocs.io/en/latest/ or http://www.chokkan.org/software/crfsuite/manual.html) are not beginner-friendly from my point of view
With python-crfsuite (or sklearn-crfsuite) training data doesn't have to be in the form you've described; a single training sequence should be a list of {"feature_name": <feature_value>"} dicts, with features for each sequence elements (e.g. for a token in a sentence). Features don't have to be words or POS tags. There is a few other supported feature formats (see http://python-crfsuite.readthedocs.io/en/latest).
For a more complete example check https://github.com/TeamHG-Memex/sklearn-crfsuite/blob/master/docs/CoNLL2002.ipynb - it uses custom features.

Validating document classification procedure using scikit-learn and NLTK (python 3.4) yielding awkward MDS stress

This is my first post on SO so I hope I'm not committing any posting crimes just yet ;-). This is verbose because part of what I am trying to do is to validate my process and ensure I understand how this is done without screwing up majorly. I will sum up my questions here:
How can I have a stress value in the 50's from an MDS? I thought it should be between 0 and 1.
Running a clustering function on coordinates obtained through MDS is a big no-no? I ask because my results do not change significantly doing so but it could just be because of my data
I want to validate my k value for the number of clusters using an "elbow" method. How can I compute this knowing that I rely on linkage() and fcluster() to plot a number of clusters against an error value? Any help on methods or calls to access that data or the data I need to compute it would be greatly appreciated.
I am working on a document classification scheme using python 3.4 for a pet projet I have where I want to feed a corpus of several thousand texts and classify them using hierarchical clustering. I also would like to use MDS to graphically represent the cluster structures (I will also use a dendrogram but want to give this a shot).
Anyway, first thing I want to do is validate my procedure to make sure I understand how this works. This is done using NLTK and scikit-learn. My objective is not to call one procedure in scikit-learn that would do everything. Rather, I want to compute my similarity matrix (using a procedure in NLTK for example) and then feed that into a clustering function, using the precomputed parameter in some of the methods I rely on.
So my steps are currently as follows:
Load corpus
Clean up corpus items: remove stop words and
unwanted chars (numerical values and other text that is not relevant
to my objective); use lemmatization (WordNet)
the end result is a matrix with n documents and m terms
Compute the similarity between documents: for each document, compute cosine
similarity against the matrix of terms.
To do that, I use TfidfVectorizer
Note: I am a python newbie so I may not do things in a pythonic way. I apologize in advance...
vectorizer = TfidfVectorizer(tokenizer = tokenize, preprocessor = preprocess)
sparse_matrix = vectorizer.fit_transform(term_dict.values())
The tokenizer and preprocessor methods are dummy methods I had to add so that it would not try and tokenize etc. my dictionary which was previously built.
The cosine similarity matrix is built using:
i = 0
return_matrix = [[0 for x in range(len(document_terms_list))] for x in range(len(document_terms_list))]
for index in enumerate(document_terms_list):
if (i < len(document_terms_list)):
similarity = cosine_similarity(sparse_matrix[i:i+1], sparse_matrix)
M = coo_matrix(similarity)
for k, j, v in zip(M.row, M.col, M.data):
return_matrix[i][j] = v
i += 1
So for 100 documents, return_matrix is basically 100 x 100 with each cell having a similarity between Doc_x and Doc_y.
My next step is to perform the clustering (I want to use complete using scipy's hierarchical clustering).
To reduce dimensionality and be able to visualize results, I first perform an MDS on the data:
mds = manifold.MDS(n_components = 2, dissimilarity = "precomputed", verbose = 1)
results = mds.fit(return_matrix)
coordinates = results.embedding_
My problem arises here: calling mds.stress_ reports a value of about 53. I was under the understanding that my stress value should be somewhere between 0 and 1. Ahem, needless to say that I am speechless with this... This would be my first question. When I print the similarity matrix etc. everything looks relatively good...
To build the clusters, I am currently passing in coordinates to the linkage() and fcluster() functions, i.e. I am passing in the MDS'ed version of my similarity matrix. Now, I wonder if this could be an issue although the results look ok when I look at the clusters assigned to my data. But conceptually, I am not sure this makes sense.
In trying to determine an ideal number of clusters, I want to use an "elbow" method, plotting the variance explained against the number of clusters to have an "ideal" cutoff. I am not sure I see this anywhere in the scikit-learn docs and tutorials. I see places where people do it in R etc. but when I use hierarchical clustering, how can I achieve this? I just don't know where to get the data from the API and what data I am looking for exactly.
Many thanks in advance. I apologize for the length of this post but I figured giving out some context might help.
Cheers,
Greg

Categories