I have a use case where I want to match one list of words with a list of sentences and bring the most relevant sentences
I am working in python. What I have already tried is using KMeans where we cluster our set of documents into the clusters and then predict the sentence that in which structure it resides. But in my case I have already available list of words available.
def getMostRelevantSentences():
Sentences = ["This is the most beautiful place in the world.",
"This man has more skills to show in cricket than any other game.",
"Hi there! how was your ladakh trip last month?",
"Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]
words = ["cricket","sports","team","play","match"]
#TODO: now this should return me the 2nd and last item from the Sentences list as the words list mostly matches with them
So from the above code I want to return the sentences which are closely matching with the words provided. I don't want to use the supervised machine learning here. Any help will be appreciated.
So finally I have used this super library called gensim to generate the similarity.
import gensim
from nltk.tokenize import word_tokenize
def getSimilarityScore(raw_documents, words):
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.Similarity('/usr/workdir',tf_idf[corpus],
num_features=len(dictionary))
query_doc_bow = dictionary.doc2bow(words)
query_doc_tf_idf = tf_idf[query_doc_bow]
return sims[query_doc_tf_idf]
You can use this method as:
Sentences = ["This is the most beautiful place in the world.",
"This man has more skills to show in cricket than any other game.",
"Hi there! how was your ladakh trip last month?",
"Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]
words = ["cricket","sports","team","play","match"]
words_lower = [w.lower() for w in words]
getSimilarityScore(Sentences,words_lower)
Related
I'm using nltk via the following code to extract nouns from a sentence:
words = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(words)
And then I choose the words tagged with the NN and NNP Part of Speech (PoS) tags. However, it only extracts single nouns like "book" and "table", yet ignores the pair of nouns like "basketball shoe". What should I do to expand the results to contain such compond noun pairs?
Assuming you just want to find noun-noun compounds (e.g. "book store") and not other combinations like noun-verb (e.g. "snow fall") or adj-noun (e.g. "hot dog"), the following solution will capture 2 or more consecutive occurrences of either the NN, NNS, NNP or NNPS Part of Speech (PoS) tags.
Example
Using the NLTK RegExpParser with the custom grammar rule defined in the solution below, three compound nouns ("basketball shoe", "book store" and "peanut butter") are extracted from the following sentence:
John lost his basketball shoe in the book store while eating peanut butter
Solution
from nltk import word_tokenize, pos_tag, RegexpParser
text = "John lost his basketball shoe in the book store while eating peanut butter"
tokenized = word_tokenize(text) # Tokenize text
tagged = pos_tag(tokenized) # Tag tokenized text with PoS tags
# Create custom grammar rule to find consecutive occurrences of nouns
my_grammar = r"""
CONSECUTIVE_NOUNS: {<N.*><N.*>+}"""
# Function to create parse tree using custom grammar rules and PoS tagged text
def get_parse_tree(grammar, pos_tagged_text):
cp = RegexpParser(grammar)
parse_tree = cp.parse(pos_tagged_text)
# parse_tree.draw() # Visualise parse tree
return parse_tree
# Function to get labels from custom grammar:
# takes line separated NLTK regexp grammar rules
def get_labels_from_grammar(grammar):
labels = []
for line in grammar.splitlines()[1:]:
labels.append(line.split(":")[0])
return labels
# Function takes parse tree & list of NLTK custom grammar labels as input
# Returns phrases which match
def get_phrases_using_custom_labels(parse_tree, custom_labels_to_get):
matching_phrases = []
for node in parse_tree.subtrees(filter=lambda x: any(x.label() == custom_l for custom_l in custom_labels_to_get)):
# Get phrases only, drop PoS tags
matching_phrases.append([leaf[0] for leaf in node.leaves()])
return matching_phrases
text_parse_tree = get_parse_tree(my_grammar, tagged)
my_labels = get_labels_from_grammar(my_grammar)
phrases = get_phrases_using_custom_labels(text_parse_tree, my_labels)
for phrase in phrases:
print(phrase)
Output
['basketball', 'shoe']
['book', 'store']
['peanut', 'butter']
As a beginner in text mining, I am trying to replicate the analyses from this paper. Essentially, the authors extract LDA topics (1-4) from a document and then "the topics extracted by LDA have been converted to suitable TF-IDF matrices that have been then used to predict..." (not important what they predict, it's a bunch of regressions). Their definition of TF and IDF (section 4.2.5) seems common, though, my understanding is that the TF-IDF measures apply to a word in a document, not topics. Given that they have a case where they extract a single topic, I think it's impossible to use the probability of the topic in a document, as this will always be 1 (though correct me if I am wrong).
So, what are the possible interpretations of converting LDA topics to "suitable TF-IDF" matrices? (and how would one go about doing that using the below code?)
Would that mean converting each and every word in a document to its TF-IDF weight and then use in prediction? That does not seem plausible as with 1000+ documents, that'd be pretty high and almost certainly most of them would be useless.
Minimally reproducible example
(credit: Jordan Barber)
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = get_stop_words('en')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
# add tokens to list
texts.append(stemmed_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)
This is code I define a noun phrase chunking method
def np_chunking(sentence):
import nltk
from nltk import word_tokenize,pos_tag, ne_chunk
from nltk import Tree
grammer = "NP: {<JJ>*<NN.*>+}\n{<NN.*>+}" # chunker rules. adjective+noun or one or more nouns
sen=sentence
cp=nltk.RegexpParser(grammer)
mychunk=cp.parse(pos_tag(word_tokenize(sen)))
result=mychunk
return result.draw()
It works like this
print(np_chunking("""I like to listen to music from musical genres,such as blues,rock and jazz."""))
But when I change the text into another sentence like
print(np_chunking("""He likes to play basketball,football and other sports."""))
I do want to extract noun phrase chunking with structure like adjective plus noun or mutiple nouns. But in the second example, the word other is in the sutructure of 'np_1, np_2 and other np_3'. After the 'and other' it often comes up with a hypernym.
In the second part
def hyponym_extract(prepared_text, hearst_patterns):
text=merge_NP(prepared_text)
hyponyms=[]
result=[]
if re.search(hearst_patterns[0][0],text)!=None:
matches=re.search(hearst_patterns[0][0],text)
NP_match=re.findall(r"NP_\w+",matches.group(0))
hyponyms=NP_match[1:]
result=[(NP_match[0],x) for x in hyponyms]
if re.search(hearst_patterns[1][0],text)!=None:
matches=re.search(hearst_patterns[1][0],text)
NP_match=re.findall(r"NP_\w+",matches.group(0))
hyponyms=NP_match[:-1]
result=[(NP_match[-1],x) for x in hyponyms]
return result
hearst_patterns = [("(NP_\w+ (, )?such as (NP_\w+ ?(, )?(and |or )?)+)", "first"),
("((NP_\w+ ?(, )?)+(and |or )?other NP_\w+)","last")] # two examples for hearst pattern
print(hyponym_extract(prepare_chunks(np_chunking("I like to listen to music from musical genres,such as blues,rock and jazz.")),hearst_patterns))
print(hyponym_extract(prepare_chunks(np_chunking("He likes to play basketball,football and other sports.")),hearst_patterns))
The other is a part of the hearst pattern to extract hypernym and hyponyms.
So how could I improve my first code to let the second one work correctly?
I have been generating topics with yelp data set of customer reviews by using Latent Dirichlet allocation(LDA) in python(gensim package). While generating tokens, I am selecting only the words having length >= 3 from the reviews( By using RegexpTokenizer):
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w{3,}')
tokens = tokenizer.tokenize(review)
This will allow us to filter out the noisy words of length less than 3, while creating the corpus document.
How will filtering out these words effect performance with the LDA algorithm?
Generally speaking, for the English language, one and two letter words don't add information about the topic. If they don't add value they should be removed during the pre-processing step. Like most algorithms, less data in will speed up the execution time.
Words less than length 3 are considered stop words. LDAs build topics so imagine you generate this topic:
[I, him, her, they, we, and, or, to]
compared to:
[shark, bull, greatwhite, hammerhead, whaleshark]
Which is more telling? This is why it is important to remove stopwords. This is how I do that:
# Create functions to lemmatize stem, and preprocess
# turn beautiful, beautifuly, beautified into stem beauti
def lemmatize_stemming(text):
stemmer = PorterStemmer()
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
# parse docs into individual words ignoring words that are less than 3 letters long
# and stopwords: him, her, them, for, there, ect since "their" is not a topic.
# then append the tolkens into a list
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
newStopWords = ['your_stopword1', 'your_stopword2']
if token not in gensim.parsing.preprocessing.STOPWORDS and token not in newStopWords and len(token) > 3:
nltk.bigrams(token)
result.append(lemmatize_stemming(token))
return result
I was testing the NLTK package's vocabulary. I used the following code and was hoping to see all True.
import nltk
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
print ('answered' in english_vocab)
print ('unanswered' in english_vocab)
print ('altered' in english_vocab)
print ('alter' in english_vocab)
print ('looks' in english_vocab)
print ('look' in english_vocab)
But my results are as follows, so many words are missing, or rather some forms of the word are missing? Am I missing something?
False
True
False
True
False
True
Indeed, the corpus is not an exhaustive list of all the english words, but rather a collection of texts. A more appropriate way of telling if a word is a valid english word is to use wordnet:
from nltk.corpus import wordnet as wn
print wn.synsets('answered')
# [Synset('answer.v.01'), Synset('answer.v.02'), Synset('answer.v.03'), Synset('answer.v.04'), Synset('answer.v.05'), Synset('answer.v.06'), Synset('suffice.v.01'), Synset('answer.v.08'), Synset('answer.v.09'), Synset('answer.v.10')]
print wn.synsets('unanswered')
# [Synset('unanswered.s.01')]
print wn.synsets('notaword')
# []
NLTK corpora do not actually store every word, they are defined as "a large body of text".
For example, you were using the words corpus, and we can check its definition by using its readme() method:
>>> print(nltk.corpus.words.readme())
Wordlists
en: English, http://en.wikipedia.org/wiki/Words_(Unix)
en-basic: 850 English words: C.K. Ogden in The ABC of Basic English (1932)
Unix's words is not exhaustive, so it may indeed be missing some words. Corpora are, by their nature, incomplete (hence the emphasis on natural language).
That being said, you might want to try using a corpus that is derived from a dictionary, like brown:
>>> print(nltk.corpus.brown.readme())
BROWN CORPUS
A Standard Corpus of Present-Day Edited American English, for use with Digital Computers.
by W. N. Francis and H. Kucera (1964)
Department of Linguistics, Brown University
Providence, Rhode Island, USA
Revised 1971, Revised and Amplified 1979
http://www.hit.uib.no/icame/brown/bcm.html
Distributed with the permission of the copyright holder, redistribution permitted.