RegEx in vocabulary not working in sklearn TfidfVectorizer

RegEx in vocabulary not working in sklearn TfidfVectorizer - python

I'm trying to calculate tf-idf of selected words in a corpus, but it didn't work when I use regex on selected words.
Below is the example I copied from another questions in stackoverflow and made small changes to reflect my question.
The code is pasted below. The code works if I write "chocolate" and "chocolates" separately but doesn't work if I write 'chocolate|chocolates'.
Can someone help me understand why and suggest possible solutions to this problem?
keywords = ['tim tam', 'jam', 'fresh milk', 'chocolate|chocolates', 'biscuit pudding']
corpus = {1: "making chocolate biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}
tfidf = TfidfVectorizer(vocabulary = keywords, stop_words = 'english', ngram_range=(1,3))
tfs = tfidf.fit_transform(corpus.values())
feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
rows, cols = tfs.nonzero()
for row, col in zip(rows, cols):
print((feature_names[col], corpus_index[row]), tfs[row, col])
tfidf_results = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index).T
I expect the results to be:
('biscuit pudding', 1) 0.652490884512534
('chocolates', 1) 0.3853716274664007
('chocolate', 1) 0.652490884512534
('chocolates', 2) 0.5085423203783267
('tim tam', 2) 0.8610369959439764
('chocolates', 3) 0.5085423203783267
('fresh milk', 3) 0.8610369959439764
But, now it returns:
('biscuit pudding', 1) 1.0
('tim tam', 2) 1.0
('fresh milk', 3) 1.0

I'm going to guess you are using TfidfVectorizer from scikit-learn. If you carefully read the documentation, nowhere it says that you can use regexes in your vocabulary, can you point to the question you mention you copied from?
If you want to group multiple terms together manually, you can specify a mapping instead of an iterable in your vocabular. For example:
keywords = {'tim tam':0, 'jam':1, 'fresh milk':2, 'chocolate':3, 'chocolates':3, 'biscuit pudding':4]
Notice how both chocolate and chocolates map to the same index.

Related

How to extract all possible noun phrases from text

I want to extract some desirable concepts (noun phrases) in the text automatically. My plan is to extract all noun phrases and then label them as two classifications (i.e., desirable phrases and non-desirable phrases). After that, train a classifier to classify them. What I am trying now is to extract all possible phrases as the training set first. For example, one sentence is Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described. I want to get all phrases like shoulder, richer mix, shoulder of richer mix,junctions,junctions of columns and beams, columns and beams, columns, beams or whatever possible. The desirable phrases are shoulder, junctions, junctions of columns and beams. But I don't care the correctness at this step, I just want to get the training set first. Are there available tools for such task?
I tried Rake in rake_nltk, but the results failed to include my desirable phrases (i.e., it did not extract all possible phrases)
from rake_nltk import Rake
data = 'Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.'
r = Rake()
r.extract_keywords_from_text(data)
phrase = r.get_ranked_phrases()
print(phrase)enter code herenter code here
Result: ['richer mix', 'shoulder', 'required', 'junctions', 'items', 'described', 'columns', 'beams']
(Missed junctions of columns and beams here)
I also tried phrasemachine, the results also missed some desirable ones.
import spacy
import phrasemachine
matchedList=[]
doc = nlp(data)
tokens = [token.text for token in doc]
pos = [token.pos_ for token in doc]
out = phrasemachine.get_phrases(tokens=tokens, postags=pos, output="token_spans")
print(out['token_spans'])
while len(out['token_spans']):
start,end = out['token_spans'].pop()
print(tokens[start:end])
Result:
[(2, 6), (4, 6), (14, 17)]
['junctions', 'of', 'columns']
['richer', 'mix']
['shoulder', 'of', 'richer', 'mix']
(Missed many noun phrases here)

You may wish to make use of noun_chunks attribute:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.')
phrases = set()
for nc in doc.noun_chunks:
phrases.add(nc.text)
phrases.add(doc[nc.root.left_edge.i:nc.root.right_edge.i+1].text)
print(phrases)
{'junctions of columns and beams', 'junctions', 'the items', 'a shoulder', 'columns', 'richer mix', 'beams', 'columns and beams', 'a shoulder of richer mix', 'these junctions'}

Counting frequency of keywords with sklearn only yielding zero counts

I am trying to run a Python code that counts the frequency of certain pre-defined keywords in a text. However, I only get zeros when running the script posted below (i.e. the script does not count any occurence of a keyword in the targeted text).
It seems that the error is stuck in the line "X = vectorizer.fit_transform(text)" since it always returns an empty variable X.
What I am trying to get as a result in this short example is a table that lists the counts of each flavour of icecream in a separate column, followed by the sum of individual counts.
import pandas as pd
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
icecream = ['Vanilla', 'Strawberry', 'Chocolate', 'Peach']
vectorizer = CountVectorizer(vocabulary=icecream, encoding='utf8', lowercase=True, analyzer='word', decode_error='ignore', ngram_range=(1, 1))
dq = pd.DataFrame(columns=icecream)
vendor = 'Franks Store'
text = ['We offer Vanilla with Hazelnut, Vanilla with Coconut, Chocolate and Strawberry']
X = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()
counts = X.sum(axis=0).A1
freq_distribution = Counter(dict(zip(vocab, counts)))
allwords = dict(freq_distribution)
totalnum = sum(allwords.values())
allwords.update({'totalnum': totalnum})
dy = pd.DataFrame.from_dict(allwords, orient='index')
dy.columns = [vendor]
dy = dy.transpose()
dq = dy.append(dq, sort=False)
print(dq)
If you have an idea on what might be wrong with this code, I would be very happy if you share it with me. Thank you!

Since you are using the lowercase=True in your parameters, all the found words will be in lowercase. But your vocabulary is this:
icecream = ['Vanilla', 'Strawberry', 'Chocolate', 'Peach']
The terms here will not match their lowercase counterparts, so everything is 0. You should change them too:
icecream = ['vanilla', 'strawberry', 'chocolate', 'peach']
The output after that is:
vanilla strawberry chocolate peach totalnum
Franks Store 2 1 1 0 4.0
Now see that vanilla has count 2, because it appears two times in the text. If you want to get only the present or absence of a specific flavor, then you can use the binary=True param in CountVectorizer.

How to extract all the ngrams from a text dataframe column in different order in a pandas dataframe?

Below is the input Dataframe I have.
id description
1 **must watch avoid** **good acting**
2 average movie bad acting
3 good movie **acting good**
4 pathetic avoid
5 **avoid watch must**
I want to extract the ngrams i.e bigram, trigram and 4 word grams from the frequently used words in phrases. Lets tokenize the phrases into words, then can we find ngrams even when the order of the frequently used words are in different order i.e (frequently used words are interchanged like in 1st phrase if we have frequently used words as "good movie" and in the 2nd phrase we have frequently used words as "movie good", can we extract the bigram as "good movie"). A sample of what i'm expecting is shown below:
ngram frequency
must watch 2
acting good 2
must watch avoid 2
average 1
As we can see in 1st sentence the frequently used words are "must watch" and in the last sentence, we have "watch must" i.e the order of the frequent words are changed. So it extracts bigrams as must watch with a frequency of 2.
I need to extract ngrams/bigrams from frequently used words from the phrases.
How to implement this using Python dataframe?
Any help is greatly appreciated.
Thanks!

import pandas as pd
from collections import Counter
from itertools import chain
data = [
{"sentence": "Run with dogs, or shoes, or dogs and shoes"},
{"sentence": "Run without dogs, or without shoes, or without dogs or shoes"},
{"sentence": "Hold this while I finish writing the python script"},
{"sentence": "Is this python script written yet, hey, hold this"},
{"sentence": "Can dogs write python, or a python script?"},
]
def find_ngrams(input_list, n):
return list(zip(*[input_list[i:] for i in range(n)]))
df = pd.DataFrame.from_records(data)
df['bigrams'] = df['sentence'].map(lambda x: find_ngrams(x.split(" "), 2))
df.head()
Now Onto the Frequency Counts
# Bigram Frequency Counts
bigrams = df['bigrams'].tolist()
bigrams = list(chain(*bigrams))
bigrams = [(x.lower(), y.lower()) for x,y in bigrams]
bigram_counts = Counter(bigrams)
bigram_counts.most_common(10)
[(('dogs,', 'or'), 2),
(('shoes,', 'or'), 2),
(('or', 'without'), 2),
(('hold', 'this'), 2),
(('python', 'script'), 2),
(('run', 'with'), 1),
(('with', 'dogs,'), 1),
(('or', 'shoes,'), 1),
(('or', 'dogs'), 1),
(('dogs', 'and'), 1)]

How to use comments in csv table to get a text classification?

I have several statements if people like a product or not I have anonymized the comments, and used other names.
As the classification problem is very similar to those of movie reviews, I relied totally on the tutorial from http://www.nltk.org/book/ch06.html. Section 1.3.
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] [1]
def document_features(document): [2]
document_words = set(document) [3]
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
So far so good.
My concern is that my data are organized in following way.
sex;statement;income_dollar_year
"m";"I REALLY like milk. It taste soo good and is healthy. Everyone should drink it";40000
"f";"Milk itself is tasty, but I don't like how cows are kept in huge stables";30000
"m";"I don't like milk, I have intolerance against lactose, so my stomach pains";35000
I just want to have labels that say that statement one and two are positive and third one is negative. I also know that I have to remove all words that are not in the dataset, otherwise I would have Zeros and that would make me into trouble.
As I do not have an idea about how the data in movies_review are exactly organized, I don't come to a solution.
my idea is to get a csv like this:
sex;statement;income_dollar_year;class
"m";"I REALLY like milk. It taste soo good and is healthy. Everyone should drink it";40000;"pos"
"f";"Milk itself is tasty, but I don't like how cows are kept in huge stables";30000;"pos"
"m";"I don't like milk, I have intolerance against lactose, so my stomach pains";35000;"neg"

Calculate tf-idf in Gensim for my vocabulary

I have a set of words (n-grams) where I need to calculate tf-idf values. These words are;
myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding']
My corpus looks as follows.
corpus = {1: "making chocolates biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}
I am currently getting tf-idf values for my n-grams in myvocabulary using sklearn as follows.
tfidf = TfidfVectorizer(vocabulary = myvocabulary, ngram_range = (1,3))
tfs = tfidf.fit_transform(corpus.values())
However, I am interested in doing the same in Gensim. Forall the examples I came across in Gensim;
uses only unigrams ( iwant it for bigrams and trigrams as well)
calculated for all the words (I only want to calculate for the words in myvocabulary)
Hence, please help me to find out how to do the above two things in Gensim.

In gensim, for a dictionary, you should use gensim.corpora.Dictionary class, look at examples
Unfortunately, we have no support ngrams in general, only bigrams for words with Phrases class

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

RegEx in vocabulary not working in sklearn TfidfVectorizer - python

Related

How to extract all possible noun phrases from text

Counting frequency of keywords with sklearn only yielding zero counts

How to extract all the ngrams from a text dataframe column in different order in a pandas dataframe?

How to use comments in csv table to get a text classification?

Calculate tf-idf in Gensim for my vocabulary

Categories

Resources