Calculate tf-idf in Gensim for my vocabulary - python

I have a set of words (n-grams) where I need to calculate tf-idf values. These words are;
myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding']
My corpus looks as follows.
corpus = {1: "making chocolates biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}
I am currently getting tf-idf values for my n-grams in myvocabulary using sklearn as follows.
tfidf = TfidfVectorizer(vocabulary = myvocabulary, ngram_range = (1,3))
tfs = tfidf.fit_transform(corpus.values())
However, I am interested in doing the same in Gensim. Forall the examples I came across in Gensim;
uses only unigrams ( iwant it for bigrams and trigrams as well)
calculated for all the words (I only want to calculate for the words in myvocabulary)
Hence, please help me to find out how to do the above two things in Gensim.

In gensim, for a dictionary, you should use gensim.corpora.Dictionary class, look at examples
Unfortunately, we have no support ngrams in general, only bigrams for words with Phrases class

Related

Meaning behind converting LDA topics to "suitable" TF-IDF matrices

As a beginner in text mining, I am trying to replicate the analyses from this paper. Essentially, the authors extract LDA topics (1-4) from a document and then "the topics extracted by LDA have been converted to suitable TF-IDF matrices that have been then used to predict..." (not important what they predict, it's a bunch of regressions). Their definition of TF and IDF (section 4.2.5) seems common, though, my understanding is that the TF-IDF measures apply to a word in a document, not topics. Given that they have a case where they extract a single topic, I think it's impossible to use the probability of the topic in a document, as this will always be 1 (though correct me if I am wrong).
So, what are the possible interpretations of converting LDA topics to "suitable TF-IDF" matrices? (and how would one go about doing that using the below code?)
Would that mean converting each and every word in a document to its TF-IDF weight and then use in prediction? That does not seem plausible as with 1000+ documents, that'd be pretty high and almost certainly most of them would be useless.
Minimally reproducible example
(credit: Jordan Barber)
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = get_stop_words('en')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
# add tokens to list
texts.append(stemmed_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

Matching set of words with set of sentences in python nlp

I have a use case where I want to match one list of words with a list of sentences and bring the most relevant sentences
I am working in python. What I have already tried is using KMeans where we cluster our set of documents into the clusters and then predict the sentence that in which structure it resides. But in my case I have already available list of words available.
def getMostRelevantSentences():
Sentences = ["This is the most beautiful place in the world.",
"This man has more skills to show in cricket than any other game.",
"Hi there! how was your ladakh trip last month?",
"Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]
words = ["cricket","sports","team","play","match"]
#TODO: now this should return me the 2nd and last item from the Sentences list as the words list mostly matches with them
So from the above code I want to return the sentences which are closely matching with the words provided. I don't want to use the supervised machine learning here. Any help will be appreciated.
So finally I have used this super library called gensim to generate the similarity.
import gensim
from nltk.tokenize import word_tokenize
def getSimilarityScore(raw_documents, words):
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.Similarity('/usr/workdir',tf_idf[corpus],
num_features=len(dictionary))
query_doc_bow = dictionary.doc2bow(words)
query_doc_tf_idf = tf_idf[query_doc_bow]
return sims[query_doc_tf_idf]
You can use this method as:
Sentences = ["This is the most beautiful place in the world.",
"This man has more skills to show in cricket than any other game.",
"Hi there! how was your ladakh trip last month?",
"Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]
words = ["cricket","sports","team","play","match"]
words_lower = [w.lower() for w in words]
getSimilarityScore(Sentences,words_lower)

RegEx in vocabulary not working in sklearn TfidfVectorizer

I'm trying to calculate tf-idf of selected words in a corpus, but it didn't work when I use regex on selected words.
Below is the example I copied from another questions in stackoverflow and made small changes to reflect my question.
The code is pasted below. The code works if I write "chocolate" and "chocolates" separately but doesn't work if I write 'chocolate|chocolates'.
Can someone help me understand why and suggest possible solutions to this problem?
keywords = ['tim tam', 'jam', 'fresh milk', 'chocolate|chocolates', 'biscuit pudding']
corpus = {1: "making chocolate biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}
tfidf = TfidfVectorizer(vocabulary = keywords, stop_words = 'english', ngram_range=(1,3))
tfs = tfidf.fit_transform(corpus.values())
feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
rows, cols = tfs.nonzero()
for row, col in zip(rows, cols):
print((feature_names[col], corpus_index[row]), tfs[row, col])
tfidf_results = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index).T
I expect the results to be:
('biscuit pudding', 1) 0.652490884512534
('chocolates', 1) 0.3853716274664007
('chocolate', 1) 0.652490884512534
('chocolates', 2) 0.5085423203783267
('tim tam', 2) 0.8610369959439764
('chocolates', 3) 0.5085423203783267
('fresh milk', 3) 0.8610369959439764
But, now it returns:
('biscuit pudding', 1) 1.0
('tim tam', 2) 1.0
('fresh milk', 3) 1.0
I'm going to guess you are using TfidfVectorizer from scikit-learn. If you carefully read the documentation, nowhere it says that you can use regexes in your vocabulary, can you point to the question you mention you copied from?
If you want to group multiple terms together manually, you can specify a mapping instead of an iterable in your vocabular. For example:
keywords = {'tim tam':0, 'jam':1, 'fresh milk':2, 'chocolate':3, 'chocolates':3, 'biscuit pudding':4]
Notice how both chocolate and chocolates map to the same index.

How to un-stem a word in Python?

I want to know if there is anyway that I can un-stem them to a normal form?
The problem is that I have thousands of words in different forms e.g. eat, eaten, ate, eating and so on and I need to count the frequency of each word. All of these - eat, eaten, ate, eating etc will count towards eat and hence, I used stemming.
But the next part of the problem requires me to find similar words in data and I am using nltk's synsets to calculate Wu-Palmer Similarity among the words. The problem is that nltk's synsets wont work on stemmed words, or at least in this code they won't. check if two words are related to each other
How should I do it? Is there a way to un-stem a word?
I think an ok approach is something like said in https://stackoverflow.com/a/30670993/7127519.
A possible implementations could be something like this:
import re
import string
import nltk
import pandas as pd
stemmer = nltk.stem.porter.PorterStemmer()
An stemmer to use. Here a text to use:
complete_text = ''' cats catlike catty cat
stemmer stemming stemmed stem
fishing fished fisher fish
argue argued argues arguing argus argu
argument arguments argument '''
Create a list with the different words:
my_list = []
#for i in complete_text.decode().split():
try:
aux = complete_text.decode().split()
except:
aux = complete_text.split()
for i in aux:
if i not in my_list:
my_list.append(i.lower())
my_list
with output:
['cats',
'catlike',
'catty',
'cat',
'stemmer',
'stemming',
'stemmed',
'stem',
'fishing',
'fished',
'fisher',
'fish',
'argue',
'argued',
'argues',
'arguing',
'argus',
'argu',
'argument',
'arguments']
An now create the dictionary:
aux = pd.DataFrame(my_list, columns =['word'] )
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x))
aux = aux.groupby('word_stemmed').transform(lambda x: ', '.join(x))
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x.split(',')[0]))
aux.index = aux['word_stemmed']
del aux['word_stemmed']
my_dict = aux.to_dict('dict')['word']
my_dict
Which output is:
{'argu': 'argue, argued, argues, arguing, argus, argu',
'argument': 'argument, arguments',
'cat': 'cats, cat',
'catlik': 'catlike',
'catti': 'catty',
'fish': 'fishing, fished, fish',
'fisher': 'fisher',
'stem': 'stemming, stemmed, stem',
'stemmer': 'stemmer'}
Companion notebook here.
No, there isn't. With stemming, you lose information, not only about the word form (as in eat vs. eats or eaten), but also about the word itself (as in tradition vs. traditional). Unless you're going to use a prediction method to try and predict this information on the basis of the context of the word, there's no way to get it back.
tl;dr: you could use any stemmer you want (e.g.: Snowball) and keep track of what word was the most popular before stemming for each stemmed word by counting occurrences.
You may like this open-source project which uses Stemming and contains an algorithm to do Inverse Stemming:
https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA
On this page of the project, there are explanations on how to do the Inverse Stemming. To sum things up, it works like as follow.
First, you will stem some documents, here short (French language) strings with their stop words removed for example:
['sup chat march trottoir',
'sup chat aiment ronron',
'chat ronron',
'sup chien aboi',
'deux sup chien',
'combien chien train aboi']
Then the trick is to have kept the count of the most popular original words with counts for each stemmed word:
{'aboi': {'aboie': 1, 'aboyer': 1},
'aiment': {'aiment': 1},
'chat': {'chat': 1, 'chats': 2},
'chien': {'chien': 1, 'chiens': 2},
'combien': {'Combien': 1},
'deux': {'Deux': 1},
'march': {'marche': 1},
'ronron': {'ronronner': 1, 'ronrons': 1},
'sup': {'super': 4},
'train': {'train': 1},
'trottoir': {'trottoir': 1}}
Finally, you may now guess how to implement this by yourself. Simply take the original words for which there was the most counts given a stemmed word. You can refer to the following implementation, which is available under the MIT License as part of the Multilingual-Latent-Dirichlet-Allocation-LDA project:
https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/lda_service/logic/stemmer.py
Improvements could be made by ditching the non-top reverse words (by using a heap for example) which would yield just one dict in the end instead of a dict of dicts.
I suspect what you really mean by stem is "tense". As in you want the different tense of each word to each count towards the "base form" of the verb.
check out the pattern package
pip install pattern
Then use the en.lemma function to return a verb's base form.
import pattern.en as en
base_form = en.lemma('ate') # base_form == "eat"
Theoretically the only way to unstem is if prior to stemming you kept a dictionary of terms or a mapping of any kind and carry on this mapping to your rest of your computations. This mapping should somehow capture the place of your unstemmed token and when there is a need to unstemm a token given that you know the original place of your stemmed token you would be able to trace back and recover the original unstemmed representation with your mapping.
For the Bag of Words representation this seems computationally intensive and somehow defeats the purpose of the statistical nature of the BoW approach.
But again theoretically I believe it could work. I haven't seen that though in any implementation.

Is there any classifiier which works at both word and sentence level?

In scikit learn or nltk classifier generally consider term frequency or TF-IDF.
I want to consider term frequency as well, sentence structure for classification. I have 15 categories of question. Each with text file containing sentence with new lines.
Category city contains this sentence:
In which city Obama was born?
If I consider on term frequency, then following might not be considered. because obama or city in dataset do not match with query sentence
1. In which place Hally was born 2. In which city Hally was born?
Is there any classifier which consider both term frequency as well sentence structure. So when trained, it classify input query with similar sentence structure too
You could train the tf-idf on ngrams as well, in addition to the unigrams.
In Scikit Learn you can specify the ngram_range that will be taken into account: if you set it to train on up to 3-grams, you would end up storing the frequency for combinations of words such as "In which place", which is pretty indicative about the type of question that is asked.
As drekyn said you can use the Scikit learn for features extraction here are some examples:
>>> bigram_vectorizer = CountVectorizer(ngram_range=(1, 2),
... token_pattern=r'\b\w+\b', min_df=1)
>>> analyze = bigram_vectorizer.build_analyzer()
>>> analyze('Bi-grams are cool!') == (
... ['bi', 'grams', 'are', 'cool', 'bi grams', 'grams are', 'are cool'])
True
Source

Categories