Counting frequency of keywords with sklearn only yielding zero counts - python

I am trying to run a Python code that counts the frequency of certain pre-defined keywords in a text. However, I only get zeros when running the script posted below (i.e. the script does not count any occurence of a keyword in the targeted text).
It seems that the error is stuck in the line "X = vectorizer.fit_transform(text)" since it always returns an empty variable X.
What I am trying to get as a result in this short example is a table that lists the counts of each flavour of icecream in a separate column, followed by the sum of individual counts.
import pandas as pd
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
icecream = ['Vanilla', 'Strawberry', 'Chocolate', 'Peach']
vectorizer = CountVectorizer(vocabulary=icecream, encoding='utf8', lowercase=True, analyzer='word', decode_error='ignore', ngram_range=(1, 1))
dq = pd.DataFrame(columns=icecream)
vendor = 'Franks Store'
text = ['We offer Vanilla with Hazelnut, Vanilla with Coconut, Chocolate and Strawberry']
X = vectorizer.fit_transform(text)
vocab = vectorizer.get_feature_names()
counts = X.sum(axis=0).A1
freq_distribution = Counter(dict(zip(vocab, counts)))
allwords = dict(freq_distribution)
totalnum = sum(allwords.values())
allwords.update({'totalnum': totalnum})
dy = pd.DataFrame.from_dict(allwords, orient='index')
dy.columns = [vendor]
dy = dy.transpose()
dq = dy.append(dq, sort=False)
print(dq)
If you have an idea on what might be wrong with this code, I would be very happy if you share it with me. Thank you!

Since you are using the lowercase=True in your parameters, all the found words will be in lowercase. But your vocabulary is this:
icecream = ['Vanilla', 'Strawberry', 'Chocolate', 'Peach']
The terms here will not match their lowercase counterparts, so everything is 0. You should change them too:
icecream = ['vanilla', 'strawberry', 'chocolate', 'peach']
The output after that is:
vanilla strawberry chocolate peach totalnum
Franks Store 2 1 1 0 4.0
Now see that vanilla has count 2, because it appears two times in the text. If you want to get only the present or absence of a specific flavor, then you can use the binary=True param in CountVectorizer.

Related

Python: How to match the words in split and non split?

I have a Dataframe as below and i wish to detect the repeated words either in split or non split words:
Table A:
Cat Comments
Stat A power down due to electric shock
Stat A powerdown because short circuit
Stat A top 10 on re work
Stat A top10 on rework
I wish to get the output as below:
Repeated words= ['Powerdown', 'top10','on','rework']
Anyone have ideas?
I assume that having the words in a dataframe column is not really relevant for the problem at hand. I will therefore transfer them into a list, and then search for repeats.
import pandas as pd
df = pd.DataFrame({"Comments": ["power down due to electric shock", "powerdown because short circuit", "top 10 on re work", "top10 on rework"]})
words = df['Comments'].to_list()
This leads to
['power down due to electric shock',
'powerdown because short circuit',
'top 10 on re work',
'top10 on rework']
Now we create a new list to account for the fact that "top 10" and "top10" should be treated equal:
newa = []
for s in words:
a = s.split()
for i in range(len(a)-1):
w = a[i]+a[i+1]
a.append(w)
newa.append(a)
which yields:
[['power',
'down',
'due',
'to',
'electric',
'shock',
'powerdown',
'downdue',
'dueto',
'toelectric',
'electricshock'],...
Finally we flatten the list and use Counter to find words which occur more than once:
from collections import Counter
from itertools import chain
wordList = list(chain(*newa))
wordCount = Counter(wordList)
[w for w,c in wordCount.most_common() if c>1]
leading to
['powerdown', 'on', 'top10', 'rework']
Let's try:
words = df['Comments'].str.split(' ').explode()
biwords = words + words.groupby(level=0).shift(-1)
(pd.concat([words.groupby(level=0).apply(pd.Series.drop_duplicates), # remove duplicates words within a comment
biwords.groupby(level=0).apply(pd.Series.drop_duplicates)]) # remove duplicate bi-words within a comment
.dropna() # remove NaN created by shifting
.to_frame().join(df[['Cat']]) # join with original Cat
.loc[lambda x: x.duplicated(keep=False)] # select the duplicated `Comments` within `Cat`
.groupby('Cat')['Comments'].unique() # select the unique values within each `Cat`
)
Output:
Cat
Stat A [powerdown, on, top10, rework]
Name: Comments, dtype: object

Text Preprocessing for NLP but from List of Dictionaries

I'm attempting to do an NLP project with a goodreads data set. my data set is a list of dictionaries. Each dictionary looks like so (the list is called 'reviews'):
>>> reviews[0]
{'user_id': '8842281e1d1347389f2ab93d60773d4d',
'book_id': '23310161',
'review_id': 'f4b4b050f4be00e9283c92a814af2670',
'rating': 4,
'review_text': 'Fun sequel to the original.',
'date_added': 'Tue Nov 17 11:37:35 -0800 2015',
'date_updated': 'Tue Nov 17 11:38:05 -0800 2015',
'read_at': '',
'started_at': '',
'n_votes': 7,
'n_comments': 0}
There are 700k+ of these dictionaries in my dataset.
First question: I am only interested in the elements 'rating' and 'review_text'. I know I can delete elements from each dictionary, but how do I do it for all of the dictionaries?
Second question: I am able to do sentence and word tokenization of an individual dictionary in the list by specifying the dictionary in the list, then the element 'review_text' within the dictionary like so:
paragraph = reviews[0]['review_text']
And then applying sent_tokenize and word_tokenize like so:
print(sent_tokenize(paragraph))
print(word_tokenize(paragraph))
But how do I apply these methods to the entire data set? I am stuck here, and cannot even attempt to do any of the text preprocessing (lower casing, removing punctuation, lemmatizing, etc).
TIA
To answer the first question, you can simply put them into dataframe with only your interesting columns (i.e. rating and review_text). This is to avoid looping and managing them record by record and is also easy to be manipulated on the further processes.
After you came up with the dataframe, use apply to preprocess (e.g. lower, tokenize, remove punctuation, lemmatize, and stem) your text column and generate new column named tokens that store the preprocessed text (i.e. tokens). This is to satisfy the second question.
from nltk import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
import string
punc_list = list(string.punctuation)
porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()
def text_processing(row):
all_words = list()
# sentence tokenize
for sent in sent_tokenize(row['review_text']):
# lower words and tokenize
words = word_tokenize(sent.lower())
# lemmatize
words_lem = [lemmatizer.lemmatize(w) for w in words]
# remove punctuation
used_words = [w for w in words_lem if w not in punc_list]
# stem
words_stem = [porter.stem(w) for w in used_words]
all_words += words_stem
return all_words
# create dataframe from list of dicts (select only interesting columns)
df = pd.DataFrame(reviews, columns=['user_id', 'rating', 'review_text'])
df['tokens'] = df.apply(lambda x: text_processing(x), axis=1)
print(df.head())
example of output:
user_id rating review_text tokens
0 1 4 Fun sequel to the [fun, sequel, to, the]
1 2 2 It was a slippery slope [it, wa, a, slipperi, slope]
2 3 3 The trick to getting [the, trick, to, get]
3 4 3 The bird had a [the, bird, had, a]
4 5 5 That dog likes cats [that, dog, like, cat]
Finally, if you don’t prefer dataframe, you can export it as other formats such as csv (to_csv), json (to_json), and list of dicts (to_dict('records')).
Hope this would help

Word count frequency: removing stopwords

I have the following list of words frequency generated by the code below.
Frequency
the 3
15 5
18 1
a 1
2020 4
... ...
house 1
apartment 1
hotel 5
pool 1
swimming 1
The code is
from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,1), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['Sentences'])
w_freq = sum(sparse_matrix).toarray()[0]
w_df=pd.DataFrame(w_freq, index=word_vectorizer.get_feature_names(), columns=['Frequency'])
w_df
I would like to remove the stopwords from the the list of words above (not in the column of my dataframe, but just in the output, creating a new variable in case it would be needed).
I have tried with w_df =[w for w in w_df if not w in stop_words] but it gave me ['Frequency'] as output.
I think this happens because it is not a list.
Could you please tell me how to remove stopwords (numbers included) from there?
Thanks
CountVectorizer has a parameter that does that for you. You can feed it a custom list of stopwords, or set it to english, a built-in stop word list. Here's an example:
s = pd.Series('Just a random sentence with more than one stopword')
word_vectorizer = CountVectorizer(ngram_range=(1,1),
analyzer='word',
stop_words='english')
sparse_matrix = word_vectorizer.fit_transform(s)
w_freq = sum(sparse_matrix).toarray()[0]
w_df=pd.DataFrame(w_freq,
index=word_vectorizer.get_feature_names(),
columns=['Frequency'])
print(w_df)
Frequency
just 1
random 1
sentence 1
stopword 1
Just to add, your approach wasn't all that wrong. You needed just a minor change.
w_df = [w for w in w_df.index if not w in stop_words]
Your problem was simply that, in the list comprehension, you iterated over the dataframe itself rather than the tokens which are in its index. This would also return the desired result.

Calculate TF-IDF using sklearn for variable-n-grams in python

Problem:
using scikit-learn to find the number of hits of variable n-grams of a particular vocabulary.
Explanation.
I got examples from here.
Imagine I have a corpus and I want to find how many hits (counting) has a vocabulary like the following one:
myvocabulary = [(window=4, words=['tin', 'tan']),
(window=3, words=['electrical', 'car'])
(window=3, words=['elephant','banana'])
What I call here window is the length of the span of words in which the words can appear. as follows:
'tin tan' is hit (within 4 words)
'tin dog tan' is hit (within 4 words)
'tin dog cat tan is hit (within 4 words)
'tin car sun eclipse tan' is NOT hit. tin and tan appear more than 4 words away from each other.
I just want to count how many times (window=4, words=['tin', 'tan']) appears in a text and the same for all the other ones and then add the result to a pandas in order to calculate a tf-idf algorithm.
I could only find something like this:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())
where vocabulary is a simple list of strings, being single words or several words.
besides from scikitlearn:
class sklearn.feature_extraction.text.CountVectorizer
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
does not help neither.
Any ideas?
I am not sure if this can be done using CountVectorizer or TfidfVectorizer. I have written my own function for doing this as follows:
import pandas as pd
import numpy as np
import string
def contained_within_window(token, word1, word2, threshold):
word1 = word1.lower()
word2 = word2.lower()
token = token.translate(str.maketrans('', '', string.punctuation)).lower()
if (word1 in token) and word2 in (token):
word_list = token.split(" ")
word1_index = [i for i, x in enumerate(word_list) if x == word1]
word2_index = [i for i, x in enumerate(word_list) if x == word2]
count = 0
for i in word1_index:
for j in word2_index:
if np.abs(i-j) <= threshold:
count=count+1
return count
return 0
SAMPLE:
corpus = [
'This is the first document. And this is what I want',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'I like coding in sklearn',
'This is a very good question'
]
df = pd.DataFrame(corpus, columns=["Test"])
your df will look like this:
Test
0 This is the first document. And this is what I...
1 This document is the second document.
2 And this is the third one.
3 Is this the first document?
4 I like coding in sklearn
5 This is a very good question
Now you can apply contained_within_window as follows:
sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))
And you get:
2
You can just run a for loop for checking different instances.
And you this to construct your pandas df and apply TfIdf on it, which is straight forward.

Python Pandas - Lambda apply keep initial format

I would like to transform this Series
from nltk import word_tokenize, pos_tag
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
df = pd.Series([["comic of book", "horror of movie"], ["dark", "dark french"]])
>> 0 [comic of book, horror of movie]
>> 1 [dark, dark french]
by removing stopwords and keeping only nouns (NN in nltk). I think the apply function is the best solution, however applying it directly to these texts generates a loss of information. I get this
df.apply(lambda x: [wrd for ing in x for wrd in word_tokenize(ing) if wrd not in stop_words])
0 [comic, book, horror, movie]
1 [dark, dark, french]
instead of
0 [comic book, horror movie]
1 [dark, dark french]
I miss something in the for loop and it separates each bag of words in unique words (maybe apply is not good here)
def rmsw(y):
return ' '.join(set(y.split()) - stop_words)
pd.Series([[rmsw(y) for y in x] for x in df], df.index)
0 [comic book, horror movie]
1 [dark, dark french]
dtype: object
To maintain order and frequency
def rmsw(y):
return ' '.join([w for w in y.split() if w not in stop_words])
pd.Series([[rmsw(y) for y in x] for x in df], df.index)
If performance is more important than elegance, a classic algorithm can do the trick.
The following code will never win a beauty contest, but it's about 350 - 400% more performant (on my ThinkPad) than the, admittedly, much nice list comprehension approach. The gap will grow with the size of your data set, as it's working in more primitive datatype (lists) and converts back to pandas in the end.
temp_list = list()
for serie in df:
elements = list()
for element in serie:
for word in element.split():
if word in stop_words:
element = element.replace(f' {word} ', ' ')
elements.append(element)
temp_list.append(elements)
df = pd.Series(temp_list)
print(df)
The choice is your up to you :)

Categories