How to use comments in csv table to get a text classification? - python

I have several statements if people like a product or not I have anonymized the comments, and used other names.
As the classification problem is very similar to those of movie reviews, I relied totally on the tutorial from http://www.nltk.org/book/ch06.html. Section 1.3.
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] [1]
def document_features(document): [2]
document_words = set(document) [3]
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
So far so good.
My concern is that my data are organized in following way.
sex;statement;income_dollar_year
"m";"I REALLY like milk. It taste soo good and is healthy. Everyone should drink it";40000
"f";"Milk itself is tasty, but I don't like how cows are kept in huge stables";30000
"m";"I don't like milk, I have intolerance against lactose, so my stomach pains";35000
I just want to have labels that say that statement one and two are positive and third one is negative. I also know that I have to remove all words that are not in the dataset, otherwise I would have Zeros and that would make me into trouble.
As I do not have an idea about how the data in movies_review are exactly organized, I don't come to a solution.
my idea is to get a csv like this:
sex;statement;income_dollar_year;class
"m";"I REALLY like milk. It taste soo good and is healthy. Everyone should drink it";40000;"pos"
"f";"Milk itself is tasty, but I don't like how cows are kept in huge stables";30000;"pos"
"m";"I don't like milk, I have intolerance against lactose, so my stomach pains";35000;"neg"

Related

Meaning behind converting LDA topics to "suitable" TF-IDF matrices

As a beginner in text mining, I am trying to replicate the analyses from this paper. Essentially, the authors extract LDA topics (1-4) from a document and then "the topics extracted by LDA have been converted to suitable TF-IDF matrices that have been then used to predict..." (not important what they predict, it's a bunch of regressions). Their definition of TF and IDF (section 4.2.5) seems common, though, my understanding is that the TF-IDF measures apply to a word in a document, not topics. Given that they have a case where they extract a single topic, I think it's impossible to use the probability of the topic in a document, as this will always be 1 (though correct me if I am wrong).
So, what are the possible interpretations of converting LDA topics to "suitable TF-IDF" matrices? (and how would one go about doing that using the below code?)
Would that mean converting each and every word in a document to its TF-IDF weight and then use in prediction? That does not seem plausible as with 1000+ documents, that'd be pretty high and almost certainly most of them would be useless.
Minimally reproducible example
(credit: Jordan Barber)
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
tokenizer = RegexpTokenizer(r'\w+')
# create English stop words list
en_stop = get_stop_words('en')
# Create p_stemmer of class PorterStemmer
p_stemmer = PorterStemmer()
# create sample documents
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
# compile sample documents into a list
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]
# list for tokenized documents in loop
texts = []
# loop through document list
for i in doc_set:
# clean and tokenize document string
raw = i.lower()
tokens = tokenizer.tokenize(raw)
# remove stop words from tokens
stopped_tokens = [i for i in tokens if not i in en_stop]
# stem tokens
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
# add tokens to list
texts.append(stemmed_tokens)
# turn our tokenized documents into a id <-> term dictionary
dictionary = corpora.Dictionary(texts)
# convert tokenized documents into a document-term matrix
corpus = [dictionary.doc2bow(text) for text in texts]
# generate LDA model
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word = dictionary, passes=20)

Retraining Spacy's Noun Chunks

I have the following sentences:
sent_1 = 'The cat caught the mouse.'
sent_2 = 'The cat caught and killed the mouse.'
Now I want to know who did what to whoom. Spacy's noun_chunks work perfectly in the first case, indicating "The cat" as the "nsubj" with the chunk.root.head.text being "caught". Likewise, "the mouse" is correctly classified as being the "dobj" with again "caught" as chunk.root.head.text. So it is easy to match these two.
However, in the second case, the nsubj gets "caught" as its chunk.root.head.text while the dobj gets "killed", whereas they actually would belong together. Is there a way to account for this kind of cases?
In the second case 'killed' is the head of the 'the mouse' as it is the text connecting the noun chunk to the rest of the phrase. From the spacy documentation:
Root text: The original text of the word connecting the noun chunk to the rest of the parse.
https://spacy.io/usage/linguistic-features#noun-chunks
N.b. that link has a very similar example to yours - a sentence with multiple noun chunks with different roots. ('Autonomous cars shift insurance liability toward manufacturers')
To answer your question, if you want 'caught' to be found as the head in both instances, then really what you're asking for is to recursively check the head of the tree for each noun_chunk... something like this:
nlp = spacy.load('en_core_web_sm')
doc = nlp('The cat caught and killed the mouse.')
[x.root.head.head for x in doc.noun_chunks]
which avails:
[caught, caught]
N.b, this works for your example but if you needed to handle arbitrary sentences then you'd need to do something a bit more sophisticated, i.e. actually recursing the tree. e.g.
def get_head(x):
return x.head if x.head.head == x.head else get_head(x.head)
resulting in:
doc2 = nlp("Autonomous cars shift insurance liability toward manufacturers away from everyday users") # adapted from the spacy example with an additional NC 'everyday users' added
In [17]: [get_head(x.root.head) for x in doc.noun_chunks]
In [187]: [caught, caught]
In [18]: [get_head(x.root.head) for x in doc2.noun_chunks]
Out[18]: [shift, shift, shift, shift]

Check how many words from a given list occur in list of text/strings

I have a list of text data which contains reviews, something likes this:
1. 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.'
2. 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".',
3. 'This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.
I have a seperate list of words which I want to know exists in the these reviews:
['food','science','good','buy','feedback'....]
I want to know which of these words are present in the review and select reviews which contains certain number of these words. For example, lets say only select reviews which contains atleast 3 of the words from this list, so it displays all those reviews, but also show which of those were encountered in the review while selecting it.
I have the code for selecting reviews containing at least 3 of the words, but how do I get the second part which tells me which words exactly were encountered. Here is my initial code:
keywords = list(words)
text = list(df.summary.values)
sentences=[]
for element in text:
if len(set(keywords)&set(element.split(' '))) >=3:
sentences.append(element)
To answer the second part, allow me to revisit how to approach the first part. A handy approach here is to cast your review strings into sets of word strings.
Like this:
review_1 = "I have bought several of the Vitality canned dog food products and"
review_1 = set(review_1.split(" "))
Now the review_1 set contains one of every word. Then take your list of words, convert it to a set, and do an intersection.
words = ['food','science','good','buy','feedback'....]
words = set(['food','science','good','buy','feedback'....])
matches = review_1.intersection(words)
The resulting set, matches, contains all the words that are common. The length of this is the number of matches.
Now, this does not work if you cared about how many of each word matches. For example, if the word "food" is found twice in the review and "science" is found once, does that count as matching three words?
If so, let me know via comment and I can write some code to update the answer to include that scenario.
EDIT: Updating to include comment question
If you want to keep a count of how many times each word repeats, then hang onto the review list. Only cast it to set when performing the intersection. Then, use the 'count' list method to count the number of times each match appears in the review. In the example below, I use a dictionary to store the results.
review_1 = "I have bought several of the Vitality canned dog food products and"
words = ['food','science','good','buy','feedback'....]
words = set(['food','science','good','buy','feedback'....])
matches = set(review_1).intersection(words)
match_counts = dict()
for match in matches:
match_counts[match] = words.count(match)
You can use set intersection for finding the common words:
def filter_reviews(data, *, trigger_words = frozenset({'food', 'science', 'good', 'buy', 'feedback'})):
for review in data:
words = review.split() # use whatever method is appropriate to get the words
common = trigger_words.intersection(words)
if len(common) >= 3:
yield review, common

Matching set of words with set of sentences in python nlp

I have a use case where I want to match one list of words with a list of sentences and bring the most relevant sentences
I am working in python. What I have already tried is using KMeans where we cluster our set of documents into the clusters and then predict the sentence that in which structure it resides. But in my case I have already available list of words available.
def getMostRelevantSentences():
Sentences = ["This is the most beautiful place in the world.",
"This man has more skills to show in cricket than any other game.",
"Hi there! how was your ladakh trip last month?",
"Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]
words = ["cricket","sports","team","play","match"]
#TODO: now this should return me the 2nd and last item from the Sentences list as the words list mostly matches with them
So from the above code I want to return the sentences which are closely matching with the words provided. I don't want to use the supervised machine learning here. Any help will be appreciated.
So finally I have used this super library called gensim to generate the similarity.
import gensim
from nltk.tokenize import word_tokenize
def getSimilarityScore(raw_documents, words):
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in raw_documents]
dictionary = gensim.corpora.Dictionary(gen_docs)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.Similarity('/usr/workdir',tf_idf[corpus],
num_features=len(dictionary))
query_doc_bow = dictionary.doc2bow(words)
query_doc_tf_idf = tf_idf[query_doc_bow]
return sims[query_doc_tf_idf]
You can use this method as:
Sentences = ["This is the most beautiful place in the world.",
"This man has more skills to show in cricket than any other game.",
"Hi there! how was your ladakh trip last month?",
"Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]
words = ["cricket","sports","team","play","match"]
words_lower = [w.lower() for w in words]
getSimilarityScore(Sentences,words_lower)

Calculate tf-idf in Gensim for my vocabulary

I have a set of words (n-grams) where I need to calculate tf-idf values. These words are;
myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding']
My corpus looks as follows.
corpus = {1: "making chocolates biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}
I am currently getting tf-idf values for my n-grams in myvocabulary using sklearn as follows.
tfidf = TfidfVectorizer(vocabulary = myvocabulary, ngram_range = (1,3))
tfs = tfidf.fit_transform(corpus.values())
However, I am interested in doing the same in Gensim. Forall the examples I came across in Gensim;
uses only unigrams ( iwant it for bigrams and trigrams as well)
calculated for all the words (I only want to calculate for the words in myvocabulary)
Hence, please help me to find out how to do the above two things in Gensim.
In gensim, for a dictionary, you should use gensim.corpora.Dictionary class, look at examples
Unfortunately, we have no support ngrams in general, only bigrams for words with Phrases class

Categories