I am trying to get unique words for each topic.
I am using gensim and this is the line that help me to generate my model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary)
But I have repeated words in two different topics, I would like to have different words per topic
You cannot enforce words uniqueness by topic in LDA since each topic is a distribution over all the words in the vocabulary. This distribution measure the probability that words co-occur inside a topic. Thus, nothing ensures that a word won't co-occur with different words in different contexts, which will leads to words represented in different topics.
Let's take an example by considering these two documents:
doc1: The python is a beautiful snake living in the forest.
doc2: Python is a beautiful language used by programmer and data scientist.
In doc1 the word python co-occur with snake, forest and living which might give a good probability for this word to appear in a topic, let's say, about biology.
In doc2, the word python co-occur with language, programmer and data which, in this case, will associate this word in a topic about computer science.
What you can eventually do, is to look for words that have the highest probability in topics in order to achieve what you want.
Words that are grouped into one topic do not mean that they are semantically similar(low distance in space mapped from word2vec). They are just co-occurred more often.
Related
I am using Gensim Phrases to identify important n-grams in my text as follows.
bigram = Phrases(documents, min_count=5)
trigram = Phrases(bigram[documents], min_count=5)
for sent in documents:
bigrams_ = bigram[sent]
trigrams_ = trigram[bigram[sent]]
However, this detects uninteresting n-grams such as special issue, important matter, high risk etc. I am particularly, interested in detecting concepts in the text such as machine learning, human computer interaction etc.
Is there a way to stop phrases detecting uninteresting n-grams as I have mentioned above in my example?
Phrases has a configurable threshold parameter which adjusts the statistical cutoff for promoting word-pairs into phrases. (Larger thresholds mean fewer pairs become phrases.)
You can adjust that to try to make a greater proportion of its promoted phrases match your own ad hoc intuition about "interesting" phrases – but this class is still using a fairly crude method, without any awareness of grammar or domain knowledge beyond what's in the corpus. So any value that gets all/most of the phrases you want will likely include many uninteresting ones, or vice-versa.
If you have a priori knowledge that certain word-groups are of importance, you could preprocess the corpus yourself to combine those into single tokens, before (or instead of) the collocation-statistics-based Phrases process.
If I understand what you're trying to do, you could try tf_idf on your corpus compared to the tf_idf of a larger, say, standard corpus (wikipedia or something).
from sklearn.feature_extraction.text import
TfidfVectorizertfidf_vectorizer =
TfidfVectorizer(max_df=0.8, max_features=500,min_df=0.2,
stop_words='english', use_idf=True, ngram_range=(1,2))
X = tfidf_vectorizer.transform(docs) # transform the documents to their tf_idf vectors
Look only at ngrams that have a very different value, this of course will only work if you have a large enough number of documents.
I am working on a language that is the not english and I have scraped the data from different sources. I have done my preprocessing like punctuation removal, stop-words removal and tokenization. Now I want to extract domain specific lexicons. Let's say that I have data related to sports, entertainment, etc and I want to extract words that are related to these particular fields, like cricket etc, and place them in topics that are closely related. I tried to use lda for this, but I am not getting the correct clusters. Also in the clusters in which a word which is a part of one topic, it also appears in other topics.
How can I improve my results?
# URDU STOP WORDS REMOVAL
doc_clean = []
stopwords_corpus = UrduCorpusReader('./data', ['stopwords-ur.txt'])
stopwords = stopwords_corpus.words()
# print(stopwords)
for infile in (wordlists.fileids()):
words = wordlists.words(infile)
#print(words)
finalized_words = remove_urdu_stopwords(stopwords, words)
doc = doc_clean.append(finalized_words)
print("\n==== WITHOUT STOPWORDS ===========\n")
print(finalized_words)
# making dictionary and corpus
dictionary = corpora.Dictionary(doc_clean)
# convert tokenized documents into a document-term matrix
matrx= [dictionary.doc2bow(text) for text in doc_clean]
# generate LDA model
lda = models.ldamodel.LdaModel(corpus=matrx, id2word=dictionary, num_topics=5, passes=10)
for top in lda.print_topics():
print("\n===topics from files===\n")
print (top)
LDA and its drawbacks: The idea of LDA is to uncover latent topics from your corpus. A drawback of this unsupervised machine learning approach, is that you will end up with topics that may be hard to interpret by humans. Another drawback is that you will most likely end up with some generic topics including words that appear in every document (like 'introduction', 'date', 'author' etc.). Thirdly, you will not be able to uncover latent topics that are simply not present enough. If you have only 1 article about cricket, it will not be recognised by the algorithm.
Why LDA doesn't fit your case:
You are searching for explicit topics like cricket and you want to learn something about cricket vocabulary, correct? However, LDA will output some topics and you need to recognise cricket vocabulary in order to determine that e.g. topic 5 is concerned with cricket. Often times the LDA will identify topics that are mixed with other -related- topics. Keeping this in mind, there are three scenarios:
You don't know anything about cricket, but you are able to identify the topic that's concerned with cricket.
You are a cricket expert and already know the cricket vocabulary
You don't know anything about cricket and are not able to identify the semantic topic that the LDA produced.
In the first case, you will have the problem that you are likely to associate words with cricket, that are actually not related to cricket, because you count on the LDA output to provide high-quality topics that are only concerned with cricket and no other related topics or generic terms. In the second case, you don't need the analysis in the first place, because you already know the cricket vocabulary! The third case is likely when you are relying on your computer to interpret the topics. However, in LDA you always rely on humans to give a semantic interpretation of the output.
So what to do: There's a paper called Targeted Topic Modeling for Focused Analysis (Wang 2016), which tries to identify which documents are concerned with a pre-defined topic (like cricket). If you have a list of topics for which you'd like to get some topic-specific vocabulary (cricket, basketball, romantic comedies, ..), a starting point could be to first identify relevant documents to then proceed and analyse the word-distributions of the documents related to a certain topic.
Note that perhaps there are completely different methods that will perform exactly what you're looking for. If you want to stay in the LDA-related literature, I'm relatively confident that the article I linked is your best shot.
Edit:
If this answer is useful to you, you may find my paper interesting, too. It takes a labeled dataset of academic economics papers (600+ possible labels) and tries various LDA flavours to get the best predictions on new academic papers. The repo contains my code, documentation and also the paper itself
I'm using the LDA algorithm from the gensim package to find topics in a given text.
I've been asked that the resulting topics will include different words for each topic, E.G If topic A has the word 'monkey' in it then no other topic should include the word 'monkey' in its list.
My thoughts so far: run it multiple times and each time add the previous words to the stop words list.
Since:
A) I'm not even sure of algorithmically/logically it's the right thing to do.
B) I hope there's a built in way to do it that i'm not aware of.
C) This is a large database, and it takes about 20 minutes to run the LDA
each time (using the multi-core version).
Question: Is there a better way to do it?
Hope to get some help,
Thanks.
LDA provides for each topic and each word a probability that the topic generates that word. You can try assigning words to topics by just taking the max over all topics of the probability to generate the word. In other words if topic A generates "monkey" with probability 0.01 and topic B generates the word monkey with probability 0.02 then you can assign the word monkey to topic B.
I think what you want to do is logically incorrect. Take for example a word like "bank" which has two different meaning("river bank" or "money bank") depending on the context. When you intentionally remove the word from one topic words it's probable that you lose the topic meaning(specially when the probability of that word is high). Take a look at this:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/topic_methods.ipynb
i think the only remaining option(if it's even be rational to do) is to use the probabilities of words in topics.
I'm trying to write a program to evaluate semantic similarity between texts. I have already compared n-gram frequencies between texts (a lexical measure). I wanted something a bit less shallow than this, and I figured that looking at similarity in sentence construction would be one way to evaluate text similarity.
However, all I can figure out how to do is to count the POS (for example, 4 nouns per text, 2 verbs, etc.). This is then similar to just counting n-grams (and actually works less well than the ngrams).
postags = nltk.pos_tag(tokens)
self.pos_freq_dist = Counter(tag for word,tag in postags)
for pos, freq in self.pos_freq_dist.iteritems():
self.pos_freq_dist_relative[pos] = freq/self.token_count #normalise pos freq by token counts
Lots of people (Pearsons, ETS Research, IBM, academics, etc.) use Parts-of-Speech for deeper measures, but no one says how they have done it. How can Parts-of-Speech be used for a 'deeper' measure of semantic text similarity?
A more sophisticated tagger is required such as http://phpir.com/part-of-speech-tagging/.
You will need to write algorithms and create word banks to determine the meaning or intention of sentences. Semantic analysis is artificial intelligence.
Nouns and capitalized nouns will be the subjects of the content. Adjectives will give some hint as to the polarity of the content. Vagueness, clarity, power, weakness, the types of words used. The possibilities are endless.
Take a look at chapter 6 of the NLTK Book. It should give you plenty of ideas for features you can use to classify text.
How can you determine the semantic similarity between two texts in python using WordNet?
The obvious preproccessing would be removing stop words and stemming, but then what?
The only way I can think of would be to calculate the WordNet path distance between each word in the two texts. This is standard for unigrams. But these are large (400 word) texts, that are natural language documents, with words that are not in any particular order or structure (other than those imposed by English grammar). So, which words would you compare between texts? How would you do this in python?
One thing that you can do is:
Kill the stop words
Find as many words as possible that have maximal intersections of synonyms and antonyms with those of other words in the same doc. Let's call these "the important words"
Check to see if the set of the important words of each document is the same. The closer they are together, the more semantically similar your documents.
There is another way. Compute sentence trees out of the sentences in each doc. Then compare the two forests. I did some similar work for a course a long time ago. Here's the code (keep in mind this was a long time ago and it was for class. So the code is extremely hacky, to say the least).
Hope this helps