Word count frequency: removing stopwords

Word count frequency: removing stopwords - python

I have the following list of words frequency generated by the code below.
Frequency
the 3
15 5
18 1
a 1
2020 4
... ...
house 1
apartment 1
hotel 5
pool 1
swimming 1
The code is
from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,1), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['Sentences'])
w_freq = sum(sparse_matrix).toarray()[0]
w_df=pd.DataFrame(w_freq, index=word_vectorizer.get_feature_names(), columns=['Frequency'])
w_df
I would like to remove the stopwords from the the list of words above (not in the column of my dataframe, but just in the output, creating a new variable in case it would be needed).
I have tried with w_df =[w for w in w_df if not w in stop_words] but it gave me ['Frequency'] as output.
I think this happens because it is not a list.
Could you please tell me how to remove stopwords (numbers included) from there?
Thanks

CountVectorizer has a parameter that does that for you. You can feed it a custom list of stopwords, or set it to english, a built-in stop word list. Here's an example:
s = pd.Series('Just a random sentence with more than one stopword')
word_vectorizer = CountVectorizer(ngram_range=(1,1),
analyzer='word',
stop_words='english')
sparse_matrix = word_vectorizer.fit_transform(s)
w_freq = sum(sparse_matrix).toarray()[0]
w_df=pd.DataFrame(w_freq,
index=word_vectorizer.get_feature_names(),
columns=['Frequency'])
print(w_df)
Frequency
just 1
random 1
sentence 1
stopword 1

Just to add, your approach wasn't all that wrong. You needed just a minor change.
w_df = [w for w in w_df.index if not w in stop_words]
Your problem was simply that, in the list comprehension, you iterated over the dataframe itself rather than the tokens which are in its index. This would also return the desired result.

Related

Text Preprocessing for NLP but from List of Dictionaries

I'm attempting to do an NLP project with a goodreads data set. my data set is a list of dictionaries. Each dictionary looks like so (the list is called 'reviews'):
>>> reviews[0]
{'user_id': '8842281e1d1347389f2ab93d60773d4d',
'book_id': '23310161',
'review_id': 'f4b4b050f4be00e9283c92a814af2670',
'rating': 4,
'review_text': 'Fun sequel to the original.',
'date_added': 'Tue Nov 17 11:37:35 -0800 2015',
'date_updated': 'Tue Nov 17 11:38:05 -0800 2015',
'read_at': '',
'started_at': '',
'n_votes': 7,
'n_comments': 0}
There are 700k+ of these dictionaries in my dataset.
First question: I am only interested in the elements 'rating' and 'review_text'. I know I can delete elements from each dictionary, but how do I do it for all of the dictionaries?
Second question: I am able to do sentence and word tokenization of an individual dictionary in the list by specifying the dictionary in the list, then the element 'review_text' within the dictionary like so:
paragraph = reviews[0]['review_text']
And then applying sent_tokenize and word_tokenize like so:
print(sent_tokenize(paragraph))
print(word_tokenize(paragraph))
But how do I apply these methods to the entire data set? I am stuck here, and cannot even attempt to do any of the text preprocessing (lower casing, removing punctuation, lemmatizing, etc).
TIA

To answer the first question, you can simply put them into dataframe with only your interesting columns (i.e. rating and review_text). This is to avoid looping and managing them record by record and is also easy to be manipulated on the further processes.
After you came up with the dataframe, use apply to preprocess (e.g. lower, tokenize, remove punctuation, lemmatize, and stem) your text column and generate new column named tokens that store the preprocessed text (i.e. tokens). This is to satisfy the second question.
from nltk import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
import string
punc_list = list(string.punctuation)
porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()
def text_processing(row):
all_words = list()
# sentence tokenize
for sent in sent_tokenize(row['review_text']):
# lower words and tokenize
words = word_tokenize(sent.lower())
# lemmatize
words_lem = [lemmatizer.lemmatize(w) for w in words]
# remove punctuation
used_words = [w for w in words_lem if w not in punc_list]
# stem
words_stem = [porter.stem(w) for w in used_words]
all_words += words_stem
return all_words
# create dataframe from list of dicts (select only interesting columns)
df = pd.DataFrame(reviews, columns=['user_id', 'rating', 'review_text'])
df['tokens'] = df.apply(lambda x: text_processing(x), axis=1)
print(df.head())
example of output:
user_id rating review_text tokens
0 1 4 Fun sequel to the [fun, sequel, to, the]
1 2 2 It was a slippery slope [it, wa, a, slipperi, slope]
2 3 3 The trick to getting [the, trick, to, get]
3 4 3 The bird had a [the, bird, had, a]
4 5 5 That dog likes cats [that, dog, like, cat]
Finally, if you don’t prefer dataframe, you can export it as other formats such as csv (to_csv), json (to_json), and list of dicts (to_dict('records')).
Hope this would help

PYTHON: Extract Non-English words and iterate it over a dataframe

I have a table of about 30,000 rows and need to extract non-English words from a column named dummy_df from a dummy_df dataframe. I need to put the non-english words in an adjacent column named non_english.
A dummy data is as thus:
dummy_df = pandas.DataFrame({'outcome': ["I want to go to church", "I love Matauranga", "Take me to Oranga Tamariki"]})
My idea is to extract non-English words from a sentence, and then iterate the process over a dataframe. I was able to accurately extract non-English words from a sentence with this code:
import nltk
nltk.download('words')
from nltk.corpus import words
words = set(nltk.corpus.words.words())
sent = "I love Matauranga"
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
if not w.lower() in words or not w.isalpha())
The result of the above code is 'Matauranga' which is perfectly correct.
But when I try to iterate the code over a dataframe using this code:
import nltk
nltk.download('words')
from nltk.corpus import words
def no_english(text):
words = set(nltk.corpus.words.words())
" ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
dummy_df['non_english'] = dummy_df.apply(no_english, axis = 1)
print(dummy_df)
I got an undesirable result in that the non_english column has none value instead of the desired non-english words (see below):
outcome non_english
0 I want to go to church None
1 I love Matauranga None
2 Take me to Oranga Tamariki None
3 None
Instead, the desired result should be:
outcome non_english
0 I want to go to church
1 I love Matauranga Matauranga
2 Take me to Oranga Tamariki Oranga Tamariki

You are missing the return in your function:
import nltk
nltk.download('words')
from nltk.corpus import words
def no_english(text):
words = set(nltk.corpus.words.words())
return " ".join(w for w in nltk.wordpunct_tokenize(text['outcome']) \
if not w.lower() in words or not w.isalpha())
dummy_df['non_english'] = dummy_df.apply(no_english, axis = 1)
print(dummy_df)
output:
outcome non_english
0 I want to go to church
1 I love Matauranga Matauranga
2 Take me to Oranga Tamariki Oranga Tamariki

Pandas quickest way to count concurrences of strings in rows (words in sentences)

Imagine that a column of a dataframe contains sentences (words separated by comma) and then I have combination of words (also words separated by comma). Words are never repeated nor in the sentences nor in the combinations. I need to count the number of sentences where each combination of word happen, order independent.
I have a column of pandas dataframe, that looks like this:
df.sentences =
0 GO:0002576,GO:0008150,GO:0043312
1 GO:0001869,GO:0002576,GO:0007597,GO:0010466,GO...
2 GO:0006400,GO:0006412,GO:0006418,GO:0006419,GO...
3 GO:0007416,GO:0030036,GO:0030097,GO:0032092,GO...
4 GO:0002407,GO:0006816,GO:0006874,GO:0006887,GO...
...
14503 GO:0002221,GO:0002223,GO:0002376,GO:0045087
14504 GO:0003351,GO:0048240
14505 GO:0001889,GO:0006351,GO:0006355,GO:0006357,GO...
14506 GO:0006892,GO:0007596,GO:0008089,GO:0016081,GO...
14507 GO:0000209,GO:0007030,GO:0008283,GO:0016567,GO...
Name: annots, Length: 14508, dtype: object
and then a list of different combination of strings derived from the above column:
combinations =
0 GO:0007165
1 GO:0007186
2 GO:0007155
3 GO:0006954
4 GO:0019221
...
16778101 GO:0000165,GO:0000209,GO:0002223,GO:0006521,GO...
16778102 GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
16778103 GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
16778104 GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
16778105 GO:0000165,GO:0000209,GO:0002223,GO:0002479,GO...
Name: itemsets, Length: 16778106, dtype: object
What I have by now:
setsentences = [set(sentece.split(',')) for sentence in df.sentences]
combinations = [set(comb.split(',')) for comb in combinations]
sentenceCount = {}
for comb in combinations:
sentenceCount[','.join(comb)] = sum([comb.issubset(sentence) for sentence in setsentences])
The problem here (IMO) is a loop of 16778105 iterations... Is there a way to use apply or map over the pandas DF or in the comb to count the sentences quickier ? Maybe tranforming the words to numbers ? Using regex ?
I hope I have explained myself sufficiently. Thanks for your time in advanced.
In order to create an example for testing:
import random,string
chars = string.ascii_uppercase + string.ascii_lowercase + string.digits
nsentences = 18000
sentences = [','.join(random.sample(chars, random.randint(1,len(chars)))) for sentence in range(nsentences)]
ncombinations = 1000000
combinations = [','.join(random.sample(chars, random.randint(1,len(chars)))) for sentence in range(ncombinations)]
setsentences = [set(sentence.split(',')) for sentence in sentences]
combinations = [set(comb.split(',')) for comb in combinations]
sentenceCount = {}
for comb in combinations:
sentenceCount[','.join(comb)] = sum([comb.issubset(sentence) for sentence in setsentences])

TFIDF separate for each label

Using TFIDFvectorizor(SKlearn), how to obtain word ranking based on tfidf score for each label separately. I want the word frequency for each label (positive and negative).
relevant code:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english',use_idf=True, ngram_range =(1,1))
features_train = vectorizer.fit_transform(features_train).todense()
features_test = vectorizer.transform(features_test).todense()
for i in range(len(features_test)):
first_document_vector=features_test[i]
df_t = pd.DataFrame(first_document_vector.T, index=feature_names, columns=["tfidf"])
df_t.sort_values(by=["tfidf"],ascending=False).head(50)

This will give you positive, neutral, and negative sentiment analysis for each row of comments in a field of a dataframe. There is a lot of preprocessing code, to get things cleaned up, filter out stop-words, do some basic charting, etc.
import pickle
import pandas as pd
import numpy as np
import pandas as pd
import re
import nltk
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
global str
df = pd.read_csv('C:\\your_path\\test_dataset.csv')
print(df.shape)
# let's experiment with some sentiment analysis concepts
# first we need to clean up the stuff in the independent field of the DF we are workign with
df['body'] = df[['body']].astype(str)
df['review_text'] = df[['review_text']].astype(str)
df['body'] = df['body'].str.replace('\d+', '')
df['review_text'] = df['review_text'].str.replace('\d+', '')
# get rid of special characters
df['body'] = df['body'].str.replace(r'[^\w\s]+', '')
df['review_text'] = df['review_text'].str.replace(r'[^\w\s]+', '')
# get rid fo double spaces
df['body'] = df['body'].str.replace(r'\^[a-zA-Z]\s+', '')
df['review_text'] = df['review_text'].str.replace(r'\^[a-zA-Z]\s+', '')
# convert all case to lower
df['body'] = df['body'].str.lower()
df['review_text'] = df['review_text'].str.lower()
# It looks like the language in body and review_text is very similar (2 fields in dataframe). let's check how closely they match...
# seems like the tone is similar, but the text is not matching at a high rate...less than 20% match rate
import difflib
body_list = df['body'].tolist()
review_text_list = df['review_text'].tolist()
body = body_list
reviews = review_text_list
s = difflib.SequenceMatcher(None, body, reviews).ratio()
print ("ratio:", s, "\n")
# filter out stop words
# these are the most common words such as: “the“, “a“, and “is“.
from nltk.corpus import stopwords
english_stopwords = stopwords.words('english')
print(len(english_stopwords))
text = str(body_list)
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[:100])
# plot most frequently occurring words in a bar chart
# remove unwanted characters, numbers and symbols
df['review_text'] = df['review_text'].str.replace("[^a-zA-Z#]", " ")
#Let’s try to remove the stopwords and short words (<2 letters) from the reviews.
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
# function to remove stopwords
def remove_stopwords(rev):
rev_new = " ".join([i for i in rev if i not in stop_words])
return rev_new
# remove short words (length < 3)
df['review_text'] = df['review_text'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))
# remove stopwords from the text
reviews = [remove_stopwords(r.split()) for r in df['review_text']]
# make entire text lowercase
reviews = [r.lower() for r in reviews]
#Let’s again plot the most frequent words and see if the more significant words have come out.
freq_words(reviews, 35)
###############################################################################
###############################################################################
# Tf-idf is a very common technique for determining roughly what each document in a set of
# documents is “about”. It cleverly accomplishes this by looking at two simple metrics: tf
# (term frequency) and idf (inverse document frequency). Term frequency is the proportion
# of occurrences of a specific term to total number of terms in a document. Inverse document
# frequency is the inverse of the proportion of documents that contain that word/phrase.
# Simple, right!? The general idea is that if a specific phrase appears a lot of times in a
# given document, but it doesn’t appear in many other documents, then we have a good idea
# that the phrase is important in distinguishing that document from all the others.
# Starting with the CountVectorizer/TfidfTransformer approach...
# convert fields in datframe to list
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
cvec = CountVectorizer(stop_words='english', min_df=1, max_df=.5, ngram_range=(1,2))
cvec
# Calculate all the n-grams found in all documents
from itertools import islice
cvec.fit(body_list)
list(islice(cvec.vocabulary_.items(), 20))
len(cvec.vocabulary_)
# Let’s take a moment to describe these parameters as they are the primary levers for adjusting what
# feature set we end up with. First is “min_df” or mimimum document frequency. This sets the minimum
# number of documents that any term is contained in. This can either be an integer which sets the
# number specifically, or a decimal between 0 and 1 which is interpreted as a percentage of all documents.
# Next is “max_df” which similarly controls the maximum number of documents any term can be found in.
# If 90% of documents contain the word “spork” then it’s so common that it’s not very useful.
# Initialize the vectorizer with new settings and check the new vocabulary length
cvec = CountVectorizer(stop_words='english', min_df=.0025, max_df=.5, ngram_range=(1,2))
cvec.fit(body_list)
len(cvec.vocabulary_)
# Our next move is to transform the document into a “bag of words” representation which essentially is
# just a separate column for each term containing the count within each document. After that, we’ll
# take a look at the sparsity of this representation which lets us know how many nonzero values there
# are in the dataset. The more sparse the data is the more challenging it will be to model
cvec_counts = cvec.transform(body_list)
print('sparse matrix shape:', cvec_counts.shape)
print('nonzero count:', cvec_counts.nnz)
print('sparsity: %.2f%%' % (100.0 * cvec_counts.nnz / (cvec_counts.shape[0] * cvec_counts.shape[1])))
# get counts of frequently occurring terms; top 20
occ = np.asarray(cvec_counts.sum(axis=0)).ravel().tolist()
counts_df = pd.DataFrame({'term': cvec.get_feature_names(), 'occurrences': occ})
counts_df.sort_values(by='occurrences', ascending=False).head(20)
# Now that we’ve got term counts for each document we can use the TfidfTransformer to calculate the
# weights for each term in each document
transformer = TfidfTransformer()
transformed_weights = transformer.fit_transform(cvec_counts)
transformed_weights
# we can take a look at the top 20 terms by average tf-idf weight.
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame({'term': cvec.get_feature_names(), 'weight': weights})
weights_df.sort_values(by='weight', ascending=False).head(20)
# FINALLY!!!!
# Here we are doing some sentiment analysis, and distilling the 'review_text' field into positive, neutral, or negative,
# based on the tone of the text in each record. Also, we are filtering out the records that have <.2 negative score;
# keeping only those that have >.2 negative score. This is interesting, but this can contain some non-intitive results.
# For instance, one record in 'review_text' literally says 'no issues'. This is probably positive, but the algo sees the
# word 'no' and interprets the comment as negative. I would argue that it's positive. We'll circle back and resolve
# this potential issue a little later.
import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
df['sentiment'] = df['review_text'].apply(lambda x: sid.polarity_scores(x))
def convert(x):
if x < 0:
return "negative"
elif x > .2:
return "positive"
else:
return "neutral"
df['result'] = df['sentiment'].apply(lambda x:convert(x['compound']))
# df.groupby(['brand','result']).size()
# df.groupby(['brand','result']).count()
x = df.groupby(['review_text','brand'])['result'].value_counts(normalize=True)
x = df.groupby(['brand'])['result'].value_counts(normalize=True)
y = x.loc[(x.index.get_level_values(1) == 'negative')]
print(y[y>0.2])
Result:
brand result
ABH negative 0.500000
Alexander McQueen negative 0.500000
Anastasia negative 0.498008
BURBERRY negative 0.248092
Beats negative 0.272947
Bowers & Wilkins negative 0.500000
Breitling Official negative 0.666667
Capri Blue negative 0.333333
FERRARI negative 1.000000
Fendi negative 0.283582
GIORGIO ARMANI negative 1.000000
Jan Marini Skin Research negative 0.250000
Jaybird negative 0.235294
LANCï¿½ME negative 0.500000
Longchamp negative 0.271605
Longchamps negative 0.500000
M.A.C negative 0.203390
Meaningful Beauty negative 0.222222
Polk Audio negative 0.256410
Pumas negative 0.222222
Ralph Lauren Polo negative 0.500000
Roberto Cavalli negative 0.250000
Samsung negative 0.332298
T3 Micro negative 0.224138
Too Faced negative 0.216216
VALENTINO by Mario Valentino negative 0.333333
YSL negative 0.250000
Feel free to skip things you find to be irrelevant, but as-is, the code does a fairly comprehensive NLP analysis.
Also, take a look at these two links.
https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/
https://towardsdatascience.com/fine-grained-sentiment-analysis-in-python-part-1-2697bb111ed4

Calculate TF-IDF using sklearn for variable-n-grams in python

Problem:
using scikit-learn to find the number of hits of variable n-grams of a particular vocabulary.
Explanation.
I got examples from here.
Imagine I have a corpus and I want to find how many hits (counting) has a vocabulary like the following one:
myvocabulary = [(window=4, words=['tin', 'tan']),
(window=3, words=['electrical', 'car'])
(window=3, words=['elephant','banana'])
What I call here window is the length of the span of words in which the words can appear. as follows:
'tin tan' is hit (within 4 words)
'tin dog tan' is hit (within 4 words)
'tin dog cat tan is hit (within 4 words)
'tin car sun eclipse tan' is NOT hit. tin and tan appear more than 4 words away from each other.
I just want to count how many times (window=4, words=['tin', 'tan']) appears in a text and the same for all the other ones and then add the result to a pandas in order to calculate a tf-idf algorithm.
I could only find something like this:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())
where vocabulary is a simple list of strings, being single words or several words.
besides from scikitlearn:
class sklearn.feature_extraction.text.CountVectorizer
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
does not help neither.
Any ideas?

I am not sure if this can be done using CountVectorizer or TfidfVectorizer. I have written my own function for doing this as follows:
import pandas as pd
import numpy as np
import string
def contained_within_window(token, word1, word2, threshold):
word1 = word1.lower()
word2 = word2.lower()
token = token.translate(str.maketrans('', '', string.punctuation)).lower()
if (word1 in token) and word2 in (token):
word_list = token.split(" ")
word1_index = [i for i, x in enumerate(word_list) if x == word1]
word2_index = [i for i, x in enumerate(word_list) if x == word2]
count = 0
for i in word1_index:
for j in word2_index:
if np.abs(i-j) <= threshold:
count=count+1
return count
return 0
SAMPLE:
corpus = [
'This is the first document. And this is what I want',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'I like coding in sklearn',
'This is a very good question'
]
df = pd.DataFrame(corpus, columns=["Test"])
your df will look like this:
Test
0 This is the first document. And this is what I...
1 This document is the second document.
2 And this is the third one.
3 Is this the first document?
4 I like coding in sklearn
5 This is a very good question
Now you can apply contained_within_window as follows:
sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))
And you get:
2
You can just run a for loop for checking different instances.
And you this to construct your pandas df and apply TfIdf on it, which is straight forward.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Word count frequency: removing stopwords - python

Related

Text Preprocessing for NLP but from List of Dictionaries

PYTHON: Extract Non-English words and iterate it over a dataframe

Pandas quickest way to count concurrences of strings in rows (words in sentences)

TFIDF separate for each label

Calculate TF-IDF using sklearn for variable-n-grams in python

Categories

Resources