I would like to transform this Series
from nltk import word_tokenize, pos_tag
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
df = pd.Series([["comic of book", "horror of movie"], ["dark", "dark french"]])
>> 0 [comic of book, horror of movie]
>> 1 [dark, dark french]
by removing stopwords and keeping only nouns (NN in nltk). I think the apply function is the best solution, however applying it directly to these texts generates a loss of information. I get this
df.apply(lambda x: [wrd for ing in x for wrd in word_tokenize(ing) if wrd not in stop_words])
0 [comic, book, horror, movie]
1 [dark, dark, french]
instead of
0 [comic book, horror movie]
1 [dark, dark french]
I miss something in the for loop and it separates each bag of words in unique words (maybe apply is not good here)
def rmsw(y):
return ' '.join(set(y.split()) - stop_words)
pd.Series([[rmsw(y) for y in x] for x in df], df.index)
0 [comic book, horror movie]
1 [dark, dark french]
dtype: object
To maintain order and frequency
def rmsw(y):
return ' '.join([w for w in y.split() if w not in stop_words])
pd.Series([[rmsw(y) for y in x] for x in df], df.index)
If performance is more important than elegance, a classic algorithm can do the trick.
The following code will never win a beauty contest, but it's about 350 - 400% more performant (on my ThinkPad) than the, admittedly, much nice list comprehension approach. The gap will grow with the size of your data set, as it's working in more primitive datatype (lists) and converts back to pandas in the end.
temp_list = list()
for serie in df:
elements = list()
for element in serie:
for word in element.split():
if word in stop_words:
element = element.replace(f' {word} ', ' ')
elements.append(element)
temp_list.append(elements)
df = pd.Series(temp_list)
print(df)
The choice is your up to you :)
Related
I currently have a list of words about MMA.
I want to create a new column in my Pandas Dataframe called 'MMA Related Word Count'. I want to analyze the column 'Speech' for each row and sum up how often words (from the list under here) occurred within the speech. Does anyone know the best way to do this? I'd love to hear it, thanks in advance!
Please take a look at my dataframe.
CODE EXAMPLE:
import pandas as pd
mma_related_words = ['mma', 'ju jitsu', 'boxing']
data = {
"Name": ['Dana White', 'Triple H'],
"Speech": ['mma is a fantastic sport. ju jitsu makes you better as a person.', 'Boxing sucks. Professional dancing is much better.']
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
CURRENT DATAFRAME:
Name
Speech
Dana White
mma is a fantastic sport. ju jitsu makes you better as a person.
Triple H
boxing sucks. Professional wrestling is much better.
--
EXPECTED OUTPUT:
Exactly same as above. But at right side new column with 'MMA Related Word Count'. For Dana White: value 2. For Triple H I want value 1.
You can use a regex with str.count:
import re
regex = '|'.join(map(re.escape, mma_related_words))
# 'mma|ju\\ jitsu|boxing'
df['Word Count'] = df['Speech'].str.count(regex, flags=re.I)
# or
# df['Word Count'] = df['Speech'].str.count(r'(?i)'+regex)
output:
Name Speech Word Count
0 Dana White mma is a fantastic sport. ju jitsu makes you b... 2
1 Triple H Boxing sucks. Professional dancing is much bet... 1
Using simple loop in apply lambda function shall work; Try this;
def fun(string):
cnt = 0
for w in mma_related_words:
if w.lower() in string.lower():
cnt = cnt + 1
return cnt
df['MMA Related Word Count'] = df['Speech'].apply(lambda x: fun(string=x))
Same can also be written as;
df['MMA Related Word Count1'] = df['Speech'].apply(lambda x: sum([1 for w in mma_related_words if w.lower() in str(x).lower()]))
Output of df;
I'm attempting to do an NLP project with a goodreads data set. my data set is a list of dictionaries. Each dictionary looks like so (the list is called 'reviews'):
>>> reviews[0]
{'user_id': '8842281e1d1347389f2ab93d60773d4d',
'book_id': '23310161',
'review_id': 'f4b4b050f4be00e9283c92a814af2670',
'rating': 4,
'review_text': 'Fun sequel to the original.',
'date_added': 'Tue Nov 17 11:37:35 -0800 2015',
'date_updated': 'Tue Nov 17 11:38:05 -0800 2015',
'read_at': '',
'started_at': '',
'n_votes': 7,
'n_comments': 0}
There are 700k+ of these dictionaries in my dataset.
First question: I am only interested in the elements 'rating' and 'review_text'. I know I can delete elements from each dictionary, but how do I do it for all of the dictionaries?
Second question: I am able to do sentence and word tokenization of an individual dictionary in the list by specifying the dictionary in the list, then the element 'review_text' within the dictionary like so:
paragraph = reviews[0]['review_text']
And then applying sent_tokenize and word_tokenize like so:
print(sent_tokenize(paragraph))
print(word_tokenize(paragraph))
But how do I apply these methods to the entire data set? I am stuck here, and cannot even attempt to do any of the text preprocessing (lower casing, removing punctuation, lemmatizing, etc).
TIA
To answer the first question, you can simply put them into dataframe with only your interesting columns (i.e. rating and review_text). This is to avoid looping and managing them record by record and is also easy to be manipulated on the further processes.
After you came up with the dataframe, use apply to preprocess (e.g. lower, tokenize, remove punctuation, lemmatize, and stem) your text column and generate new column named tokens that store the preprocessed text (i.e. tokens). This is to satisfy the second question.
from nltk import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
import string
punc_list = list(string.punctuation)
porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()
def text_processing(row):
all_words = list()
# sentence tokenize
for sent in sent_tokenize(row['review_text']):
# lower words and tokenize
words = word_tokenize(sent.lower())
# lemmatize
words_lem = [lemmatizer.lemmatize(w) for w in words]
# remove punctuation
used_words = [w for w in words_lem if w not in punc_list]
# stem
words_stem = [porter.stem(w) for w in used_words]
all_words += words_stem
return all_words
# create dataframe from list of dicts (select only interesting columns)
df = pd.DataFrame(reviews, columns=['user_id', 'rating', 'review_text'])
df['tokens'] = df.apply(lambda x: text_processing(x), axis=1)
print(df.head())
example of output:
user_id rating review_text tokens
0 1 4 Fun sequel to the [fun, sequel, to, the]
1 2 2 It was a slippery slope [it, wa, a, slipperi, slope]
2 3 3 The trick to getting [the, trick, to, get]
3 4 3 The bird had a [the, bird, had, a]
4 5 5 That dog likes cats [that, dog, like, cat]
Finally, if you don’t prefer dataframe, you can export it as other formats such as csv (to_csv), json (to_json), and list of dicts (to_dict('records')).
Hope this would help
I have the following data frame called sentences
data = ["Home of the Jacksons"], ["Is it the real thing?"], ["What is it with you?"], [ "Tomatoes are the best"] [ "I think it's best to path ways now"]
sentences = pd.DataFrame(data, columns = ['sentence'])
And a dataframe called stopwords:
data = [["the"], ["it"], ["best"], [ "is"]]
stopwords = pd.DataFrame(data, columns = ['word'])
I want to remove all stopwords from sentences["sentence"]. I tried the code below but it does not work. I think there is an issue with my if statement. Can anyone help?
Def remove_stopwords(input_string, stopwords_list):
stopwords_list = list(stopwords_list)
my_string_split = input_string.split(' ')
my_string = []
for word in my_string_split:
if word not in stopwords_list:
my_string.append(word)
my_string = " ".join(my_string)
return my_string
sentence['cut_string']= sentence.apply(lambda row: remove_stopwords(row['sentence'], stopwords['word']), axis=1)
When I apply the function, it just returns the first or first few strings in the sentence but does not cut out stopwords at all. Kinda stuck here
You can convert stopwords word to list and remove those words from sentences using list comprehension,
stopword_list = stopwords['word'].tolist()
sentences['filtered] = sentences['sentence'].apply(lambda x: ' '.join([i for i in x.split() if i not in stopword_list]))
You get
0 Home of Jacksons
1 Is real thing?
2 What with you?
3 Tomatoes are
4 I think it's to path ways now
Or you can wrap the code in a function,
def remove_stopwords(input_string, stopwords_list):
my_string = []
for word in input_string.split():
if word not in stopwords_list:
my_string.append(word)
return " ".join(my_string)
stopword_list = stopwords['word'].tolist()
sentences['sentence'].apply(lambda row: remove_stopwords(row, stopword_list))
You have many syntax errors in your code above. If you keep the stopwords as a list (or set) rather than DataFrame the following will work -
data = ["Home of the Jacksons", "Is it the real thing?", "What is it with you?", "Tomatoes are the best", "I think it's best to path ways now"]
sentences = pd.DataFrame(data, columns = ['sentence'])
stopwords = ["the", "it", "best", "is"]
sentences.sentence.str.split().apply(lambda x: " ".join([y for y in x if y not in stopwords]))
The key to success is to convert the list of stopwords into a set(): sets have O(1) lookup times, while lists' time is O(N).
stop_set = set(stopwords.word.tolist())
sentences.sentence.str.split()\
.apply(lambda x: ' '.join(w for w in x if w not in stop_set))
Problem:
using scikit-learn to find the number of hits of variable n-grams of a particular vocabulary.
Explanation.
I got examples from here.
Imagine I have a corpus and I want to find how many hits (counting) has a vocabulary like the following one:
myvocabulary = [(window=4, words=['tin', 'tan']),
(window=3, words=['electrical', 'car'])
(window=3, words=['elephant','banana'])
What I call here window is the length of the span of words in which the words can appear. as follows:
'tin tan' is hit (within 4 words)
'tin dog tan' is hit (within 4 words)
'tin dog cat tan is hit (within 4 words)
'tin car sun eclipse tan' is NOT hit. tin and tan appear more than 4 words away from each other.
I just want to count how many times (window=4, words=['tin', 'tan']) appears in a text and the same for all the other ones and then add the result to a pandas in order to calculate a tf-idf algorithm.
I could only find something like this:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())
where vocabulary is a simple list of strings, being single words or several words.
besides from scikitlearn:
class sklearn.feature_extraction.text.CountVectorizer
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
does not help neither.
Any ideas?
I am not sure if this can be done using CountVectorizer or TfidfVectorizer. I have written my own function for doing this as follows:
import pandas as pd
import numpy as np
import string
def contained_within_window(token, word1, word2, threshold):
word1 = word1.lower()
word2 = word2.lower()
token = token.translate(str.maketrans('', '', string.punctuation)).lower()
if (word1 in token) and word2 in (token):
word_list = token.split(" ")
word1_index = [i for i, x in enumerate(word_list) if x == word1]
word2_index = [i for i, x in enumerate(word_list) if x == word2]
count = 0
for i in word1_index:
for j in word2_index:
if np.abs(i-j) <= threshold:
count=count+1
return count
return 0
SAMPLE:
corpus = [
'This is the first document. And this is what I want',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'I like coding in sklearn',
'This is a very good question'
]
df = pd.DataFrame(corpus, columns=["Test"])
your df will look like this:
Test
0 This is the first document. And this is what I...
1 This document is the second document.
2 And this is the third one.
3 Is this the first document?
4 I like coding in sklearn
5 This is a very good question
Now you can apply contained_within_window as follows:
sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))
And you get:
2
You can just run a for loop for checking different instances.
And you this to construct your pandas df and apply TfIdf on it, which is straight forward.
I would like to create a new pandas column by running a word stemming function over a list of words in another column. I can tokenize a single string by using apply and lambda, but I cannot figure out how to extrapolate this to the case of running it over a list of words.
test = {'Statement' : ['congratulations on the future','call the mechanic','more text'], 'Other' : [2,3,4]}
df = pd.DataFrame(test)
df['tokenized'] = df.apply (lambda row: nltk.word_tokenize(row['Statement']), axis=1)
I know I could solve it with a nested for loop, but that seems inefficient and results in a SettingWithCopyWarning:
df['stems'] = ''
for x in range(len(df)):
print(len(df['tokenized'][x]))
df['stems'][x] = row_stems=[]
for y in range(len(df['tokenized'][x])):
print(df['tokenized'][x][y])
row_stems.append(stemmer.stem(df['tokenized'][x][y]))
Isn't there a better way to do this?
EDIT:
Here's an example of what the result should look like:
Other Statement tokenized stems
0 2 congratulations on the future [congratulations, on, the, future] [congratul, on, the, futur]
1 3 call the mechanic [call, the, mechanic] [call, the, mechan]
2 4 more text [more, text] [more, text]
No need to run a loop, indeed. At least not an explicit loop. A list comprehension will work just fine.
Assuming you use Porter stemmer ps:
df['stems'] = df['tokenized'].apply(lambda words:
[ps.stem(word) for word in words])