Removing stopwords from a pandas series based on list - python

I have the following data frame called sentences
data = ["Home of the Jacksons"], ["Is it the real thing?"], ["What is it with you?"], [ "Tomatoes are the best"] [ "I think it's best to path ways now"]
sentences = pd.DataFrame(data, columns = ['sentence'])
And a dataframe called stopwords:
data = [["the"], ["it"], ["best"], [ "is"]]
stopwords = pd.DataFrame(data, columns = ['word'])
I want to remove all stopwords from sentences["sentence"]. I tried the code below but it does not work. I think there is an issue with my if statement. Can anyone help?
Def remove_stopwords(input_string, stopwords_list):
stopwords_list = list(stopwords_list)
my_string_split = input_string.split(' ')
my_string = []
for word in my_string_split:
if word not in stopwords_list:
my_string.append(word)
my_string = " ".join(my_string)
return my_string
sentence['cut_string']= sentence.apply(lambda row: remove_stopwords(row['sentence'], stopwords['word']), axis=1)
When I apply the function, it just returns the first or first few strings in the sentence but does not cut out stopwords at all. Kinda stuck here

You can convert stopwords word to list and remove those words from sentences using list comprehension,
stopword_list = stopwords['word'].tolist()
sentences['filtered] = sentences['sentence'].apply(lambda x: ' '.join([i for i in x.split() if i not in stopword_list]))
You get
0 Home of Jacksons
1 Is real thing?
2 What with you?
3 Tomatoes are
4 I think it's to path ways now
Or you can wrap the code in a function,
def remove_stopwords(input_string, stopwords_list):
my_string = []
for word in input_string.split():
if word not in stopwords_list:
my_string.append(word)
return " ".join(my_string)
stopword_list = stopwords['word'].tolist()
sentences['sentence'].apply(lambda row: remove_stopwords(row, stopword_list))

You have many syntax errors in your code above. If you keep the stopwords as a list (or set) rather than DataFrame the following will work -
data = ["Home of the Jacksons", "Is it the real thing?", "What is it with you?", "Tomatoes are the best", "I think it's best to path ways now"]
sentences = pd.DataFrame(data, columns = ['sentence'])
stopwords = ["the", "it", "best", "is"]
sentences.sentence.str.split().apply(lambda x: " ".join([y for y in x if y not in stopwords]))

The key to success is to convert the list of stopwords into a set(): sets have O(1) lookup times, while lists' time is O(N).
stop_set = set(stopwords.word.tolist())
sentences.sentence.str.split()\
.apply(lambda x: ' '.join(w for w in x if w not in stop_set))

Related

Applying Snowballstemmer to a Pandas dataframe for each word

SO I want to apply stemming using Snowballstemmer on a column (unstemmed) of a dataframe in order to use a classification algorithm.
So my code looks like the following:
df = pd.read_excel(...)
df["content"] = df['column2'].str.lower()
stopword_list = nltk.corpus.stopwords.words('dutch')
df['unstemmed'] = df['content'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopword_list)]))
df["unstemmed"] = df["unstemmed"].str.replace(r"[^a-zA-Z ]+", " ").str.strip()
df["unstemmed"] = df["unstemmed"].replace('\s+', ' ', regex=True)
df['unstemmed'] = df['unstemmed'].str.split()
df['stemmed'] = df['unstemmed'].apply(lambda x : [stemmer.stem(y) for y in x])
So first, I convert all upper cases to lower cases and remove all Dutch stopwords. This is followed by removing all special characters and then splitting all words. I checked and all columns are "objects".
I get the following error: stem() missing 1 required positional argument: 'token'
How can I solve this?

why does if not in(x,y) not work at all in python

I want to select words only if the word in each rows of my column not in stop words and not in string punctuation.
This is my data after tokenizing and removing the stopwords, i also want to remove the punctuation at the same time i remove the stopwords. See in number two after usf there's comma. I think of if word not in (stopwords,string.punctuation) since it would be not in stopwords and not in string.punctuation i see it from here but it resulting in fails to remove stopwords and the punctuation. How to fix this?
data['text'].head(5)
Out[38]:
0 ['ve, searching, right, words, thank, breather...
1 [free, entry, 2, wkly, comp, win, fa, cup, fin...
2 [nah, n't, think, goes, usf, ,, lives, around,...
3 [even, brother, like, speak, ., treat, like, a...
4 [date, sunday, !, !]
Name: text, dtype: object
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
data = pd.read_csv(r"D:/python projects/read_files/SMSSpamCollection.tsv",
sep='\t', header=None)
data.columns = ['label','text']
stopwords = set(stopwords.words('english'))
def process(df):
data = word_tokenize(df.lower())
data = [word for word in data if word not in (stopwords,string.punctuation)]
return data
data['text'] = data['text'].apply(process)
If you still want to do it in one if statement, you could convert string.punctuation to a set and combine it with stopwords with an OR operation. This is how it would look like:
data = [word for word in data if word not in (stopwords|set(string.punctuation))]
then you need to change
data = [word for word in data if word not in (stopwords,string.punctuation)]
to
data = [word for word in data if word not in stopwords and word not in string.punctuation]
in function process you must Convert type(String) to pandas.core.series.Series and use
concat
the function will be:
'
def process(df):
data = word_tokenize(df.lower())
data = [word for word in data if word not in
pd.concat([stopwords,pd.Series(string.punctuation)]) ]
return data

How to remove several different stems at the end of the word using slicing

Although I understand there are tools such as NLTK to do this for me, however, I would like to understand how I can slice several stems within a list efficiently.
Say my list of words is;
list = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]
And my common stems I would like to remove may be;
stems = ["s", "es", "ed", "est", "ing", "ly"] etc
With words I do not want stemmed specified as;
noStem = ["walrus", "rest", "wing", "feed"]
I have worked out how to do it for one specific stem like "s". For example, my code would be;
for eachWord in list:
if eachWord not in noStem:
if eachWord[-1] == "s":
eachWord = eachWord[:-1]
stemmedList = stemmedList + [eachWord]
I am not sure how I would apply this to all my stems in a more efficient way.
Thanks for your help and advice!
I would suggest you convert noStem to a set so that the check if eachWord not in noStem is fast. Then you can check if the word endswith any stem in stems. If it does, you can use the largest stem that matches and remove it from the word:
lst = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]
stems = ["s", "es", "ed", "est", "ing", "ly"]
noStem = {"walrus", "rest", "wing", "feed"}
stemmedList = []
for word in lst:
if word in noStem or not any([word.endswith(stem) for stem in stems]):
stemmedList.append(word)
else:
stem = max([s for s in stems if word.endswith(s)], key=len)
stemmedList.append(word[:len(word) - len(stem)])
print(stemmedList)
# ['another', 'cat', 'walrus', 'relax', 'annoying', 'rest', 'normal', 'hopp', 'class', 'wing', 'feed']
It's much more complicated than this but here goes a starter code using much faster pandas module. Here it goes.
import pandas as pd
import re
word_list = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]
stems = ["es", "ed", "est", "ing", "ly", "s"]
# a set for quick lookup
noStem = set(["walrus", "rest", "wing", "feed"])
# build series
words = pd.Series(word_list)
# filter out words in noStem
words = words[words.apply(lambda x: x not in noStem)]
# compile regular explession - performance - join all stems for matching
term_matching = '|'.join(stems)
expr = re.compile(r'(.+?)({})$'.format(term_matching))
df = words.str.extract(expr, expand=True)
df.dropna(how='any', inplace=True)
df.columns = ['words', 'stems']
stemmed_list = df.words.tolist()
I hope it helps ...
I think that isn't a bad start. You just need to add the second loop to be able to work with multiple endings. You can try something like below, (You'll notice I've renamed the variable list because it's dangerous having variables shadowing built-in names)
stemmed_list = []
for word in word_list:
if word not in noStem:
for ending in stems:
if word.endswith(ending):
word = word[:-len(ending)]
break # This will prevent iterating over all endings once match is found
stemmed_list.append(word)
Or if as per your comment you don't want to use endswith
stemmed_list = []
for word in word_list:
if word not in noStem:
for ending in stems:
if word[-len(ending):] == ending:
word = word[:-len(ending)]
break # This will prevent iterating over all endings once match is found
stemmed_list.append(word)

Flattening 3D list of words to 2D

I have a pandas column with text strings. For simplicity ,lets assume I have a column with two strings.
s=["How are you. Don't wait for me", "this is all fine"]
I want to get something like this:
[["How", "are","you"],["Don't", "wait", "for", "me"],["this","is","all","fine"]]
Basically take each sentence of a document and tokenism into list of words. So finally I need list of list of string.
I tried using a map like below:
nlp=spacy.load('en')
def text_to_words(x):
""" This function converts sentences in a text to a list of words
"""
global log_txt
x=re.sub("\s\s+" , " ", x.strip())
txt_to_words= [str(doc).replace(".","").split(" ") for doc in nlp(x).sents]
#log_txt=log_txt.extend(txt_to_words)
return txt_to_words
The nlp from spacy is used to split a string of text into list of sentences.
log_txt=list(map(text_to_words,s))
log_txt
But this as you know would put all of the result from both the documents into another list
[[['How', 'are', 'you'], ["Don't", 'wait', 'for', 'me']],
[['this', 'is', 'all', 'fine']]]
You'll need a nested list comprehension. Additionally, you can get rid of punctuation using re.sub.
import re
data = ["How are you. Don't wait for me", "this is all fine"]
words = [
re.sub([^a-z\s], '', j.lower()).split() for i in data for j in nlp(i).sents
]
Or,
words = []
for i in data:
... # do something here
for j in nlp(i).sents:
words.append(re.sub([^a-z\s], '', j.lower()).split())
There is a much simpler way for list comprehension.
You can first join the strings with a period '.' and split them again.
[x.split() for x in '.'.join(s).split('.')]
It will give the desired result.
[["How", "are","you"],["Don't", "wait", "for", "me"],["this","is","all","fine"]]
For Pandas dataframes, you may get an object, and hence a list of lists after tolist function in return. Just extract the first element.
For example,
import pandas as pd
def splitwords(s):
s1 = [x.split() for x in '.'.join(s).split('.')]
return s1
df = pd.DataFrame(s)
result = df.apply(splitwords).tolist()[0]
Again, it will give you the preferred result.
Hope it helps ;)

python word grouping based on words before and after

I am trying create groups of words. First I am counting all words. Then I establish the top 10 words by word count. Then I want to create 10 groups of words based on those top 10. Each group consist of all the words that are before and after the top word.
I have survey results stored in a python pandas dataframe structured like this
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
I also saved the answers column as a string.
I am using the following code to find 3 words before and after a word ( I actually had to create a string out of the answers column)
answers_str = df.Answer.apply(str)
for value in answers_str:
non_data = re.split('data|Data', value)
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
I have been manually creating groups of words - but is there a way of doing it in python?
So based on the example shown above the group with word counts would look like this:
group "data":
data : 2
important: 1
value: 1
need:1
then when it goes through the whole file, there would be another group:
group "analytics:
analyze: 5
report: 7
list: 10
visualize: 16
The idea would be to get rid of "we", "to","is" as well - but I can do it manually, if that's not possible.
Then to establish the 10 most used words (by word count) and then create 10 groups with words that are in front and behind those main top 10 words.
We can use regex for this. We'll be using this regular expression
((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})
which you can test for yourself here, to extract the three words before and after each occurence of data
First, let's remove all the words we don't like from the strings.
import re
# If you're processing a lot of sentences, it's probably wise to preprocess
#the pattern, assuming that bad_words is the same for all sentences
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
The we want to get the words that surround data in each line
data_pat = r'((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})'
res = re.findall(pat, s, flags=re.IGNORECASE)
gives us a list of tuples of strings. We want to get a list of those strings after they are split.
from itertools import chain
list_of_words = list(chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res))))))
That's not pretty, but it works. Basically, we pull the tuples out of the list, pull the strings out of each tuples, then split each string then pull all the strings out of the lists they end up in into one big list.
Let's put this all together with your pandas code. pandas isn't my strongest area, so please don't assume that I haven't made some elementary mistake if you see something weird looking.
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
c = Counter()
data_pat = r'((?:\b\w+?\b\s*){0,3})data((?:\s*\b\w+?\b){0,3})'
for sentence in sentence_list:
res = re.findall(data_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
The nice thing about the regex we're using is that all of the complicated parts don't care about what word we're using. With a slight change, we can make a format string
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
such that
base_pat.format('data') == data_pat
So with some list of words we want to collect information about key_words
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
key_words = ['data', 'analytics']
d = {}
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
for keyword in key_words:
key_pat = base_pat.format(keyword)
c = Counter()
for sentence in sentence_list:
res = re.findall(key_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
d[keyword] = c
Now we have a dictionary d that maps keywords, like data and analytics to Counters that map words that are not on our blacklist to their counts in the vicinity of the associated keyword. Something like
d= {'data' : Counter({ 'important' : 2,
'very' : 3}),
'analytics' : Counter({ 'boring' : 5,
'sleep' : 3})
}
As to how we get the top 10 words, that's basically the thing Counter is best at.
key_words, _ = zip(*Counter(w for sentence in sentence_list for w in sentence.split()).most_common(10))

Categories