Stemming a pandas dataframe - python

I have tweet dataset (taken from NLTK) which is currently in a pandas dataframe, but I need to stem it. I have tried many different ways and get some different errors, such as
AttributeError: 'Series' object has no attribute 'lower'
and
KeyError: 'text'
I dont understand the KeyError as the column is definitely called 'text', however I understand that I need to change the dataframe to a string in order for the stemmer to work (I think).
Here is an example of the data
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
negative_tweets = twitter_samples.strings('negative_tweets.json')
negtweetsdf = DataFrame(negative_tweets,columns=['text'])
print(stemmer.stem(negtweetstr))

You need to apply the stemming function to the series as follows
negtweetsdf.apply(stemmer.stem)
This will create a new series.
Functions that expect a single string value or similar will not simply work on a pandas dataframe or series. They need to be applied to the entire series, which is why .apply is used.
Here is a worked example with lists inside a dataframe column.
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import TweetTokenizer
stemmer = SnowballStemmer("english")
import pandas as pd
df = pd.DataFrame([['some extremely exciting tweet'],['another']], columns=['tweets'])
# put the strings into lists
df = pd.DataFrame(df.apply(list,axis=1), columns=['tweets'])
# for each row (apply) for each item in the list, apply the stemmer
# return a list containing the stems
df['tweets'].apply(lambda x: [stemmer.stem(y) for y in x])

Related

TypError: sequence item 0: expected str instance, tuple found

I have a column with tuples which I would like to remove the brackets from.
Example
words
(hello,me)
(what,can)
(ring, dog)
I have tried this:
df['words'].agg(','.join)
Unfortunately I receive the error in the title.
I would like this output:
words
hello,me
what,can
ring, dog
Any solution?
Also, strangely enough, with a different dataset that line of code works. Any ideas why?
I think you can use df.apply to update the words column with the new value by applying a function to modify the value of each row
import pandas as pd
df = pd.DataFrame({'words': [('hello','me'), ('what','can')]})
df['words'] = df.apply (lambda row: ','.join(row[0]), axis=1)
Edit: come to think of it, your original approach using df['words'].agg should also work but you need to assign it the words column for it to make change to the dataframe
import pandas as pd
df = pd.DataFrame({'words': [('hello','me'), ('what','can')]})
df['words'] = df['words'].agg(','.join)
print(df)

How to stem a pandas dataframe using nltk ? The output should be a stemmed dataframe

I'm trying to pre-process a dataset. The dataset contains text data. I have created a pandas DataFrame from that dataset.
my question is, how can I use stemming on the DataFrame and get a stemmed DataFrame as output?
Given a certain pandas df you can stem the contents by applying a stemming function on the whole df after tokenizing the words.
For this, I exemplarily used the snowball stemmer from nltk.
from nltk.stem.snowball import SnowballStemmer
englishStemmer=SnowballStemmer("english") #define stemming dict
And this tokenizer:
from nltk.tokenize import WhitespaceTokenizer as w_tokenizer
Define your function:
def stemm_texts(text):
return [englishStemmer.stem(w) for w in w_tokenizer.tokenize(str(text))]
Apply the function on your df:
df = df.apply(lambda y: y.map(stemm_texts, na_action='ignore'))
Note that I additionally added the NaN ignore part.
You might want to detokenize again:
from nltk.tokenize.treebank import TreebankWordDetokenizer
detokenizer = TreebankWordDetokenizer()
df = df.apply(lambda y: y.map(detokenizer.detokenize, na_action='ignore'))

How to access data in a re.findinter output object?

I'd like to access the 'span' and 'match' data from the object I've generated with regex.findinter. But I can't find how to transfer the object structure into a pandas df so I can manipulate it more easily.
I can iterate through the object to print the data. But the regex.findinter documentation does not say how to access the data. The best I can find is the page https://docs.python.org/2.0/lib/match-objects.html
I tried just appending the rows to a pandas df but no luck. See code. It gives error:
TypeError: cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
import re
import pandas as pd
def find_rez(string):
regex = re.compile(r'\s\d{10}\s')
return(regex.finditer(string))
#open file with text data
file = open('prepaid_transactions_test2.txt')
text = file.read()
#get regex object with locations of all matches.
rez_mo = find_rez(text)
#Create empty df with span and match columns.
df = pd.DataFrame(columns=['span','match'])
#Append each row from object to pandas df. NOT WORKING.
for i in rez_mo:
df.append(i)
I'd like to have a pandas df with the range & match as columns. But I'm failing at converting the types it seems.
I just found a solution. May not be most elegant but....it works.
for i in rez_mo:
df.loc[len(df)]=[i.start()],[i.group()]

fuzzywuzzy to normalize string in pandas column

I have a dataframe like this
now i want to normalize the string in the 'comments' column for the word 'election' . I tried using fuzzywuzzy but wasn't able to implement it on pandas dataframe to partially match the word 'election'. The output dataframe should have the word 'election' in the 'comments' column like this
Assume that i have around 100k rows and possible combinations for the word 'election' can be many.
Kindly guide me on this part.
with the answer you gave, you can use pandas apply, stack and groupby functions to accelerate your code. you have input such as:
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'Merchant details': ['Alpha co','Bravo co'],
'Comments':['electionsss are around',
'vote in eelecttions']})
For the column 'comments', you can create a temporary mutiindex DF containing a word per row by splitting and using stack function:
df_temp = pd.DataFrame(
{'split_comments':df['Comments'].str.split(' ',expand=True).stack()})
Then you create the column with corrected word (according to your idea), using apply and the comparision of fuzz.ratio:
df_temp['corrected_comments'] = df_temp['split_comments'].apply(
lambda wd: 'election' if fuzz.ratio(wd, 'election') > 75 else wd)
Finally, you write back in your column Comments of df with the corrected data using groupby and join functions:
df['Comments'] = df_temp.reset_index().groupby('level_0').apply(
lambda wd: ' '.join(wd['corrected_comments']))
Don't operate on the dataframe. The overhead will kill you. Turn the column into a list, then iteratecover that. And finally assign that list back to the column.
Ok i tried this myself and came up with this code -
for i in range(len(df)):
a = []
a = df.comments[i].split()
for j in word:
for k in range(len(a)):
if fuzz.ratio(j,a[k]) > 75:
a[k] = j
df.comments[i] = a
df.comments[i] = ' '.join(df.comments[i])
But this approach seems slow for a large dataframe.
Can someone provide a better pythonic way of implementing this.

dropping row containing non-english words in pandas dataframe

I turned this twitter corpus into pandas data frame and I was trying to find the none English tweets and delete them from the data frame, so I did this:
for j in range(0,150):
if not wordnet.synsets(df.i[j]):#Comparing if word is non-English
df.drop(j)
print(df.shape)
but I check the shape, no row was dropped.
Am I using the drop function wrong, or do I need to keep track of the index of the row?
That's because df.drop() returns a copy instead of modifying your original dataframe. Try set inplace=True
for j in range(0,150):
if not wordnet.synsets(df.i[j]):#Comparing if word is non-English
df.drop(j, inplace=True)
print(df.shape)
This will filter out all the non-English rows in our pandas dataframe.
import nltk
nltk.download('words')
from nltk.corpus import words
import pandas as pd
data1 = pd.read_csv("testdata.csv")
Word = list(set(words.words()))
df_final = data1[data1['column_name'].str.contains('|'.join(Word))]
print(df_final)

Categories