Text Preprocessing for NLP but from List of Dictionaries - python

I'm attempting to do an NLP project with a goodreads data set. my data set is a list of dictionaries. Each dictionary looks like so (the list is called 'reviews'):
>>> reviews[0]
{'user_id': '8842281e1d1347389f2ab93d60773d4d',
'book_id': '23310161',
'review_id': 'f4b4b050f4be00e9283c92a814af2670',
'rating': 4,
'review_text': 'Fun sequel to the original.',
'date_added': 'Tue Nov 17 11:37:35 -0800 2015',
'date_updated': 'Tue Nov 17 11:38:05 -0800 2015',
'read_at': '',
'started_at': '',
'n_votes': 7,
'n_comments': 0}
There are 700k+ of these dictionaries in my dataset.
First question: I am only interested in the elements 'rating' and 'review_text'. I know I can delete elements from each dictionary, but how do I do it for all of the dictionaries?
Second question: I am able to do sentence and word tokenization of an individual dictionary in the list by specifying the dictionary in the list, then the element 'review_text' within the dictionary like so:
paragraph = reviews[0]['review_text']
And then applying sent_tokenize and word_tokenize like so:
print(sent_tokenize(paragraph))
print(word_tokenize(paragraph))
But how do I apply these methods to the entire data set? I am stuck here, and cannot even attempt to do any of the text preprocessing (lower casing, removing punctuation, lemmatizing, etc).
TIA

To answer the first question, you can simply put them into dataframe with only your interesting columns (i.e. rating and review_text). This is to avoid looping and managing them record by record and is also easy to be manipulated on the further processes.
After you came up with the dataframe, use apply to preprocess (e.g. lower, tokenize, remove punctuation, lemmatize, and stem) your text column and generate new column named tokens that store the preprocessed text (i.e. tokens). This is to satisfy the second question.
from nltk import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
import string
punc_list = list(string.punctuation)
porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()
def text_processing(row):
all_words = list()
# sentence tokenize
for sent in sent_tokenize(row['review_text']):
# lower words and tokenize
words = word_tokenize(sent.lower())
# lemmatize
words_lem = [lemmatizer.lemmatize(w) for w in words]
# remove punctuation
used_words = [w for w in words_lem if w not in punc_list]
# stem
words_stem = [porter.stem(w) for w in used_words]
all_words += words_stem
return all_words
# create dataframe from list of dicts (select only interesting columns)
df = pd.DataFrame(reviews, columns=['user_id', 'rating', 'review_text'])
df['tokens'] = df.apply(lambda x: text_processing(x), axis=1)
print(df.head())
example of output:
user_id rating review_text tokens
0 1 4 Fun sequel to the [fun, sequel, to, the]
1 2 2 It was a slippery slope [it, wa, a, slipperi, slope]
2 3 3 The trick to getting [the, trick, to, get]
3 4 3 The bird had a [the, bird, had, a]
4 5 5 That dog likes cats [that, dog, like, cat]
Finally, if you don’t prefer dataframe, you can export it as other formats such as csv (to_csv), json (to_json), and list of dicts (to_dict('records')).
Hope this would help

Related

Word count frequency: removing stopwords

I have the following list of words frequency generated by the code below.
Frequency
the 3
15 5
18 1
a 1
2020 4
... ...
house 1
apartment 1
hotel 5
pool 1
swimming 1
The code is
from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,1), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['Sentences'])
w_freq = sum(sparse_matrix).toarray()[0]
w_df=pd.DataFrame(w_freq, index=word_vectorizer.get_feature_names(), columns=['Frequency'])
w_df
I would like to remove the stopwords from the the list of words above (not in the column of my dataframe, but just in the output, creating a new variable in case it would be needed).
I have tried with w_df =[w for w in w_df if not w in stop_words] but it gave me ['Frequency'] as output.
I think this happens because it is not a list.
Could you please tell me how to remove stopwords (numbers included) from there?
Thanks
CountVectorizer has a parameter that does that for you. You can feed it a custom list of stopwords, or set it to english, a built-in stop word list. Here's an example:
s = pd.Series('Just a random sentence with more than one stopword')
word_vectorizer = CountVectorizer(ngram_range=(1,1),
analyzer='word',
stop_words='english')
sparse_matrix = word_vectorizer.fit_transform(s)
w_freq = sum(sparse_matrix).toarray()[0]
w_df=pd.DataFrame(w_freq,
index=word_vectorizer.get_feature_names(),
columns=['Frequency'])
print(w_df)
Frequency
just 1
random 1
sentence 1
stopword 1
Just to add, your approach wasn't all that wrong. You needed just a minor change.
w_df = [w for w in w_df.index if not w in stop_words]
Your problem was simply that, in the list comprehension, you iterated over the dataframe itself rather than the tokens which are in its index. This would also return the desired result.

why does if not in(x,y) not work at all in python

I want to select words only if the word in each rows of my column not in stop words and not in string punctuation.
This is my data after tokenizing and removing the stopwords, i also want to remove the punctuation at the same time i remove the stopwords. See in number two after usf there's comma. I think of if word not in (stopwords,string.punctuation) since it would be not in stopwords and not in string.punctuation i see it from here but it resulting in fails to remove stopwords and the punctuation. How to fix this?
data['text'].head(5)
Out[38]:
0 ['ve, searching, right, words, thank, breather...
1 [free, entry, 2, wkly, comp, win, fa, cup, fin...
2 [nah, n't, think, goes, usf, ,, lives, around,...
3 [even, brother, like, speak, ., treat, like, a...
4 [date, sunday, !, !]
Name: text, dtype: object
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
data = pd.read_csv(r"D:/python projects/read_files/SMSSpamCollection.tsv",
sep='\t', header=None)
data.columns = ['label','text']
stopwords = set(stopwords.words('english'))
def process(df):
data = word_tokenize(df.lower())
data = [word for word in data if word not in (stopwords,string.punctuation)]
return data
data['text'] = data['text'].apply(process)
If you still want to do it in one if statement, you could convert string.punctuation to a set and combine it with stopwords with an OR operation. This is how it would look like:
data = [word for word in data if word not in (stopwords|set(string.punctuation))]
then you need to change
data = [word for word in data if word not in (stopwords,string.punctuation)]
to
data = [word for word in data if word not in stopwords and word not in string.punctuation]
in function process you must Convert type(String) to pandas.core.series.Series and use
concat
the function will be:
'
def process(df):
data = word_tokenize(df.lower())
data = [word for word in data if word not in
pd.concat([stopwords,pd.Series(string.punctuation)]) ]
return data

Calculate TF-IDF using sklearn for variable-n-grams in python

Problem:
using scikit-learn to find the number of hits of variable n-grams of a particular vocabulary.
Explanation.
I got examples from here.
Imagine I have a corpus and I want to find how many hits (counting) has a vocabulary like the following one:
myvocabulary = [(window=4, words=['tin', 'tan']),
(window=3, words=['electrical', 'car'])
(window=3, words=['elephant','banana'])
What I call here window is the length of the span of words in which the words can appear. as follows:
'tin tan' is hit (within 4 words)
'tin dog tan' is hit (within 4 words)
'tin dog cat tan is hit (within 4 words)
'tin car sun eclipse tan' is NOT hit. tin and tan appear more than 4 words away from each other.
I just want to count how many times (window=4, words=['tin', 'tan']) appears in a text and the same for all the other ones and then add the result to a pandas in order to calculate a tf-idf algorithm.
I could only find something like this:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())
where vocabulary is a simple list of strings, being single words or several words.
besides from scikitlearn:
class sklearn.feature_extraction.text.CountVectorizer
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
does not help neither.
Any ideas?
I am not sure if this can be done using CountVectorizer or TfidfVectorizer. I have written my own function for doing this as follows:
import pandas as pd
import numpy as np
import string
def contained_within_window(token, word1, word2, threshold):
word1 = word1.lower()
word2 = word2.lower()
token = token.translate(str.maketrans('', '', string.punctuation)).lower()
if (word1 in token) and word2 in (token):
word_list = token.split(" ")
word1_index = [i for i, x in enumerate(word_list) if x == word1]
word2_index = [i for i, x in enumerate(word_list) if x == word2]
count = 0
for i in word1_index:
for j in word2_index:
if np.abs(i-j) <= threshold:
count=count+1
return count
return 0
SAMPLE:
corpus = [
'This is the first document. And this is what I want',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'I like coding in sklearn',
'This is a very good question'
]
df = pd.DataFrame(corpus, columns=["Test"])
your df will look like this:
Test
0 This is the first document. And this is what I...
1 This document is the second document.
2 And this is the third one.
3 Is this the first document?
4 I like coding in sklearn
5 This is a very good question
Now you can apply contained_within_window as follows:
sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))
And you get:
2
You can just run a for loop for checking different instances.
And you this to construct your pandas df and apply TfIdf on it, which is straight forward.

How do I singularize and lemmatize an entire pandas dataframe column using TextBlob?

I have a pandas dataframe that contains the following columns: df['adjectives'], df['nouns'], and df['adverbs']. Each of these columns contains lists of tokens based on their respective parts of speech.
I would like to use TextBlob to create three new columns in my data frame, df['adjlemmatized'], df['nounlemmatized'], and df['advlemmatized'].
Each of these columns should contain wordlists consisting of words in their singularized, lemma form.
I have tried following the TextBlob documentation, but I am stuck writing functions that will iterate over my entire dataframe.
Words Inflection and Lemmatization
Each word in TextBlob.words or Sentence.words is a Word object (a subclass of unicode) with useful methods, e.g. for word inflection.
>>> sentence = TextBlob('Use 4 spaces per indentation level.')
>>> sentence.words
WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])
>>> sentence.words[2].singularize()
'space'
>>> sentence.words[-1].pluralize()
'levels'
Words can be lemmatized by calling the lemmatize method.
>>> from textblob import Word
>>> w = Word("octopi")
>>> w.lemmatize()
'octopus'
>>> w = Word("went")
>>> w.lemmatize("v") # Pass in WordNet part of speech (verb)
'go'
Here is the code I used to get the parts of speech from my text:
# get adjectives
def get_adjectives(text):
blob = TextBlob(text)
print(text)
return [word for (word,tag) in blob.tags if tag.startswith("JJ")]
df['adjectives'] = df['clean_reviews'].apply(get_adjectives)
If your words are already tokenized and you want to keep them that way, it's easy:
df['adjlemmatized'] = df.adjectives.apply(lambda x: [ TextBlob(w) for w in x])
df['adjlemmatized'] = df.adjlemmatized.apply(lambda x: [ w.lemmatize() for w in x])

python word grouping based on words before and after

I am trying create groups of words. First I am counting all words. Then I establish the top 10 words by word count. Then I want to create 10 groups of words based on those top 10. Each group consist of all the words that are before and after the top word.
I have survey results stored in a python pandas dataframe structured like this
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
I also saved the answers column as a string.
I am using the following code to find 3 words before and after a word ( I actually had to create a string out of the answers column)
answers_str = df.Answer.apply(str)
for value in answers_str:
non_data = re.split('data|Data', value)
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
I have been manually creating groups of words - but is there a way of doing it in python?
So based on the example shown above the group with word counts would look like this:
group "data":
data : 2
important: 1
value: 1
need:1
then when it goes through the whole file, there would be another group:
group "analytics:
analyze: 5
report: 7
list: 10
visualize: 16
The idea would be to get rid of "we", "to","is" as well - but I can do it manually, if that's not possible.
Then to establish the 10 most used words (by word count) and then create 10 groups with words that are in front and behind those main top 10 words.
We can use regex for this. We'll be using this regular expression
((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})
which you can test for yourself here, to extract the three words before and after each occurence of data
First, let's remove all the words we don't like from the strings.
import re
# If you're processing a lot of sentences, it's probably wise to preprocess
#the pattern, assuming that bad_words is the same for all sentences
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
The we want to get the words that surround data in each line
data_pat = r'((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})'
res = re.findall(pat, s, flags=re.IGNORECASE)
gives us a list of tuples of strings. We want to get a list of those strings after they are split.
from itertools import chain
list_of_words = list(chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res))))))
That's not pretty, but it works. Basically, we pull the tuples out of the list, pull the strings out of each tuples, then split each string then pull all the strings out of the lists they end up in into one big list.
Let's put this all together with your pandas code. pandas isn't my strongest area, so please don't assume that I haven't made some elementary mistake if you see something weird looking.
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
c = Counter()
data_pat = r'((?:\b\w+?\b\s*){0,3})data((?:\s*\b\w+?\b){0,3})'
for sentence in sentence_list:
res = re.findall(data_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
The nice thing about the regex we're using is that all of the complicated parts don't care about what word we're using. With a slight change, we can make a format string
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
such that
base_pat.format('data') == data_pat
So with some list of words we want to collect information about key_words
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
key_words = ['data', 'analytics']
d = {}
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
for keyword in key_words:
key_pat = base_pat.format(keyword)
c = Counter()
for sentence in sentence_list:
res = re.findall(key_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
d[keyword] = c
Now we have a dictionary d that maps keywords, like data and analytics to Counters that map words that are not on our blacklist to their counts in the vicinity of the associated keyword. Something like
d= {'data' : Counter({ 'important' : 2,
'very' : 3}),
'analytics' : Counter({ 'boring' : 5,
'sleep' : 3})
}
As to how we get the top 10 words, that's basically the thing Counter is best at.
key_words, _ = zip(*Counter(w for sentence in sentence_list for w in sentence.split()).most_common(10))

Categories