I´m new to LDA and doing some experiments with Python + LDA and some sample datasets.
I already got some very interesting results and now I asked myself a question but couldn´t find an answer so far.
Since I worked with customer reviews/ratings of a certain app the documents contain different topics (e.g. one talks about the app performance, price, functionality). So for my understanding I got three topics within one document.
My question: Is LDA capable to assign more than one topic to one document?
Thank you for your answer!
Yes, this is the whole idea of LDA - is searched for topic distribution inside the documents. So it will assign several topics to documents (unless they are too short).
Related
I am trying to create an recommendation system for similar articles. I do have a list of articles as reference and I want other new articles that I acquire from a certain API needs to be similar to those reference articles.
One way I could is just merge all of those reference article into one big article and run cosine similarity and get list of articles that are similar to to merged reference articles. Is there any other way I could implement cosine similarity?
Thanks
Have you looked into an NLP technique called Topic Modelling?
This following notebook from priya-dwivedi uses LDA (Latent Dirichlet Allocation):
https://github.com/priya-dwivedi/Deep-Learning/blob/master/topic_modeling/LDA_Newsgroup.ipynb
I am using gensim to do LDA on a corpus of arXiv abstracts in the category stats.ML
My problem is that there is a lot of overlap between the topics (whether I pick 5, 10, or 50 topics). Every topic has a distribution of words like "model" "algorithm", or "problem." How can topics be considered differentiable if so many of them prominently feature the same terms?
Using pyLDAvis was instructive for me . This is the distribution for topic #3:
But when I turn down lambda = 0.08, the actual nature of the topic emerges (ML in medical applications):
So my question is, how could I uncover these distinctive terms in the course of training my LDA model (without pyLDAvis)? And also, does the performance (as opposed to interpret-ability) of the model improve if I can get it to ignore these common, non-discriminating terms?
I have several ideas to try but would like more guidance:
Filtering the 50 most common terms from my dictionary. While I think it helped a bit, I'm not sure if it's the right approach
Tweaking eta parameter in gensim.models.LdaModel
My goal is ultimately to take a new document and do word coloring on it based on which words relate to which topics, and also offer the documents most similar to the input document.
I am pretty new with gensim, and this is my first SO question, so if I'm totally off-base with something, please let me know ;-). Thank you
I am wondering if there is a way to use NLP (specifically the nltk module in python) to find similarities between the subjects within sentences. The problem is that the texts refer back to subjects within a separate sentence, and don't specifically refer to them by name (E.g. www.legaltips.org/Alabama/alabama_code/2-2-30.aspx). Any ideas or experience with this would be super helpful.
The short answer to your question is yes. :)
It sounds like the problem you are trying to solve is what we call anaphora or co-reference resolution in NLP - although that only refers to tracking the same referent through different sentences. You can try getting started here: http://nlp.stanford.edu/software/dcoref.shtml
If you want to find simply similarities then this is a different problem entirely - you should let people know what kind of similarities you are talking about - semantic, syntatic, etc... and then you can get an answer (if that is your problem).
I have a list of questions in a text file extracted from online website. I am new to nltk (in Python) and going through initial chapters from ( http://http://shop.oreilly.com/product/9780596516499.do ) . Please anybody help me out for categorizing my topics under different headings.
I don't know the heading of the questions. So, how to create headings and categorize then thenafter ???
Your task consists of document clustering, where each question is a document, and cluster labeling, where label designates topic.
Note that if your questions are short and/or hardly separable, e.g. belong to similar categories, then quality would be not so high.
Take a look at simple recipe for document clustering and related questions first and second.
As a baseline for labels, try max tf-idf words from cluster words or from centroids.
I am using Gensim to do some large-scale topic modeling. I am having difficulty understanding how to determine predicted topics for an unseen (non-indexed) document. For example: I have 25 million documents which I have converted to vectors in LSA (and LDA) space. I now want to figure out the topics of a new document, lets call it x.
According to the Gensim documentation, I can use:
topics = lsi[doc(x)]
where doc(x) is a function that converts x into a vector.
The problem is, however, that the above variable, topics, returns a vector. The vector is useful if I am comparing x to additional documents because it allows me to find the cosine similarity between them, but I am unable to actually return specific words that are associated with x itself.
Am I missing something, or does Gensim not have this capability?
Thank you,
EDIT
Larsmans has the answer.
I was able to show the topics by using:
for t in topics:
print lsi.show_topics(t[0])
The vector returned by [] on an LSI model is actually a list of (topic, weight) pairs. You can inspect a topic by means of the method LsiModel.show_topic
I was able to show the topics by using:
for t in topics:
print lsi.show_topics(t[0])
Just wanted to point out a tiny, but important, bug in your solution code: you need to use show_topic() function rather than the show_topic**s**() function.
P.S. I know this should be posted as a comment rather than an answer, but my current reputation score does not allow comments just yet!