I have a dataframe that consists of two columns: ID and TEXT. Pretend data is below:
ID TEXT
265 The farmer plants grain. The fisher catches tuna.
456 The sky is blue.
434 The sun is bright.
921 I own a phone. I own a book.
I know all nltk functions do not work on dataframes. How could sent_tokenize be applied to the above dataframe?
When I try:
df.TEXT.apply(nltk.sent_tokenize)
The output is unchanged from the original dataframe. My desired output is:
TEXT
The farmer plants grain.
The fisher catches tuna.
The sky is blue.
The sun is bright.
I own a phone.
I own a book.
In addition, I would like to tie back this new (desired) dataframe to the original ID numbers like this (following further text cleansing):
ID TEXT
265 'farmer', 'plants', 'grain'
265 'fisher', 'catches', 'tuna'
456 'sky', 'blue'
434 'sun', 'bright'
921 'I', 'own', 'phone'
921 'I', 'own', 'book'
This question is related to another of my questions here. Please let me know if I can provide anything to help clarify my question!
edit: as a result of warranted prodding by #alexis here is a better response
Sentence Tokenization
This should get you a DataFrame with one row for each ID & sentence:
sentences = []
for row in df.itertuples():
for sentence in row[2].split('.'):
if sentence != '':
sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])
Whose output looks like this:
split('.') will quickly break strings up into sentences if sentences are in fact separated by periods and periods are not being used for other things (e.g. denoting abbreviations), and will remove periods in the process. This will fail if there are multiple use cases for periods and/or not all sentence endings are denoted by periods. A slower but much more robust approach would be to use, as you had asked, sent_tokenize to split rows up by sentence:
sentences = []
for row in df.itertuples():
for sentence in sent_tokenize(row[2]):
sentences.append((row[1], sentence))
new_df = pandas.DataFrame(sentences, columns=['ID', 'SENTENCE'])
This produces the following output:
If you want to quickly remove periods from these lines you could do something like:
new_df['SENTENCE_noperiods'] = new_df.SENTENCE.apply(lambda x: x.strip('.'))
Which would yield:
You can also take the apply -> map approach (df is your original table):
df = df.join(df.TEXT.apply(sent_tokenize).rename('SENTENCES'))
Yielding:
Continuing:
sentences = df.SENTENCES.apply(pandas.Series)
sentences.columns = ['sentence {}'.format(n + 1) for n in sentences.columns]
This yields:
As our indices have not changed, we can join this back into our original table:
df = df.join(sentences)
Word Tokenization
Continuing with df from above, we can extract the tokens in a given sentence as follows:
df['sent_1_words'] = df['sentence 1'].apply(word_tokenize)
This is a little complicated. I apply sentence tokenization first then go through each sentences and remove words from remove_words list and remove punctuation for each word inside.
import pandas as pd
from nltk import sent_tokenize
from string import punctuation
remove_words = ['the', 'an', 'a']
def remove_punctuation(chars):
return ''.join([c for c in chars if c not in punctuation])
# example dataframe
df = pd.DataFrame([[265, "The farmer plants grain. The fisher catches tuna."],
[456, "The sky is blue."],
[434, "The sun is bright."],
[921, "I own a phone. I own a book."]], columns=['sent_id', 'text'])
df.loc[:, 'text_split'] = df.text.map(sent_tokenize)
sentences = []
for _, r in df.iterrows():
for s in r.text_split:
filtered_words = [remove_punctuation(w) for w in s.split() if w.lower() not in remove_words]
# or using nltk.word_tokenize
# filtered_words = [w for w in word_tokenize(s) if w.lower() not in remove_words and w not in punctuation]
sentences.append({'sent_id': r.sent_id,
'text': s.strip('.'),
'words': filtered_words})
df_words = pd.DataFrame(sentences)
Output
+-------+--------------------+--------------------+
|sent_id| text| words|
+-------+--------------------+--------------------+
| 265|The farmer plants...|[farmer, plants, ...|
| 265|The fisher catche...|[fisher, catches,...|
| 456| The sky is blue| [sky, is, blue]|
| 434| The sun is bright| [sun, is, bright]|
| 921| I own a phone| [I, own, phone]|
| 921| I own a book| [I, own, book]|
+-------+--------------------+--------------------+
Related
Lets say i have df like this
ID
name_x
st
string
1
xx
us
Being unacquainted with the chief raccoon was harming his prospects for promotion
2
xy
us1
The overpass went under the highway and into a secret world
3
xz
us
He was 100% into fasting with her until he understood that meant he couldn't eat
4
xu
us2
Random words in front of other random words create a random sentence
5
xi
us1
All you need to do is pick up the pen and begin
Using python and pandas for column st I want count name_x values and then extract top 3 key words from string.
For example like this:
st
name_x_count
top1_word
top2_word
top3_word
us
2
word1
word2
word3
us1
2
word1
word2
word3
us2
1
word1
word2
word3
Is there any way to solve this task?
I would first groupby() to concatenate the strings as you show and then use collections Counter and then most_common. Finally assign it back to the dataframe. I am using x.lower() because otherwise "He" and "he" will be considered a different word (but you can always remove it if this is intended):
output = df.groupby('st').agg(
name_x_count = pd.NamedAgg('name_x','count'),
string = pd.NamedAgg('string',' '.join))
After grouping by we create the columns by using collections.Counter():
output[['top1_word','top2_word','top3_word']] = output['string'].map(lambda x: [x[0] for x in collections.Counter(x.lower().split()).most_common(3)])
output = output.drop(columns='string')
Output:
name_x_count top1_word top2_word top3_word
st
us 2 he with was
us1 2 the and overpass
us2 1 random words in
First, I added a space at the end of each string, as we will combine sentences while grouping. Then I consolidated the sentences after grouping by the st column.
df['string']=df['string'] + ' ' # we will use sum function. When combining sentences, there should be spaces in between.
dfx=df.groupby('st').agg({'st':'count','string':'sum'}) #groupby st and combine strings
Then list each word of the string expression, calculate their distribution and get the first 3 values.
from collections import Counter
mask=dfx['string'].apply(lambda x: list(dict(Counter(x.split()).most_common()[:3]).keys()))
print(mask)
'''
st string
us ['with', 'was', 'he']
us1 ['the', 'and', 'The']
us2 ['words', 'random', 'Random']
'''
Finally, add these first 3 words as new columns.
dfx[['top1_word','top2_word','top3_word']]=pd.DataFrame(mask.tolist(), index= mask.index)
dfx
st name_x_count top1_word top2_word top3_word
us 2 with was he
us1 2 the and The
us2 1 words random Random
I want to derive bigrams and used the following code to do so:
from sklearn.feature_extraction.text import CountVectorizer
def create_vectorizer():
return CountVectorizer(lowercase=False, stop_words=['a', 'an','the','The'], ngram_range=(1, 3))
reviews_english["Review Gast"] = reviews_english["Review Gast"].astype(str).str.lower()
res = [(x, i.split()[j + 1]) for i in reviews_english["Review Gast"]
for j, x in enumerate(i.split()) if j < len(i.split()) - 1]
res
I got the following results:
However, I would like to get the bigrams per row rather than for the whole list.
How can I do this?
Thanks
You can use CountVectorizer to fit_transform per row. However since it requires a corpus/list of text you will have to convert your string in the row to a list of single string.
Sample
df = pd.DataFrame({
'text': ["a cat on the table",
"a dog under the table",
"an apple over the tree"]
})
cv = CountVectorizer(analyzer='word', ngram_range=(2, 2))
bigrams = []
for txt in df["text"].astype(str).str.lower():
cv.fit_transform([txt])
bigrams.append(cv.get_feature_names())
df['bigrams'] = bigrams
print (df)
output:
text bigrams
0 a cat on the table [cat on, on the, the table]
1 a dog under the table [dog under, the table, under the]
2 an apple over the tree [an apple, apple over, over the, the tree]
I am trying to produce a very simple twitter sentiment analysis. I have so far been able to pre-process my tweets however I am greatly struggling to lemmatize within my data frame. This is my code so far:
import nltk
import pandas as pd
from nltk.corpus import stopwords # Importing Natural Language Toolkit
from nltk.stem import WordNetLemmatizer
df = pd.read_csv(r'/Users/sarfrazkhan/Desktop/amazon.csv') # Loading Amazon data set into code
df = df['x'].str.replace('http\S+|www.\S+', '', case=False) # Removing URL's from data set
df = df.str.replace(r'\<.*\>', '') # Removing noise contained in '< >' these parenthesis
df = df.str.replace('RT ', '', case=False) # Removing the phrase 'RT" from all strings
df = df.str.replace('#[^\s]+', '', case=False) # Removing '#' and the following twitter handle from strings
df = df.str.replace('[^\w\s]', ' ') # Removing any punctuation
df = df.str.replace('\r\n', ' ') # Removing '\r\n' which is present in some strings
df = df.str.replace('\d+', '').str.lower().str.strip() # Removing numbers, capitalisation and white space
df = df.apply(nltk.word_tokenize) # Tokenizing data set
stop = nltk.download('stopwords') # Downloading stop words
stop = set(stopwords.words('english')) # Selecting English stop words
df = df.apply(lambda x: [item for item in x if item not in stop]) # Removing stop words from each string
lemmatizer = WordNetLemmatizer()
lemma_words = [lemmatizer.lemmatize(w, pos='a') for w in df]
I am struggling to get my Lemmatizer to work and constantly met with errors possibly due to the fact my dataset is in a list form. (which I am struggling to work around) The Excel which I am trying to process is simply a long list of tweets with the heading 'x'. You can see on line 6 of my code that I focus primarily on this column, however I'm unsure if this is the correct way to do it!
My expected outcome would be a list of words which have been lemmatised correctly within their respective rows, to which I can then carry out a sentiment analysis.
These are the first few lines of my data frame before attempting the lemmatising process:
1 [swinging, pendulum, wall, clock, love, give, ...
2 [enter, via, gleam, l]
3 [screw, every, follow, gets, nude, dms, dm, pr...
4 [bishop, law, coming, soon, bishop, series, bo...
5 [adventures, bella, amp, emily, book, series, ...
6 [written, books, various, genres, amazon, kind...
7 [author, books, amwriting, fantasy, mystery, p...
8 [wonderful, mentor, recent, times, graham, kee...
9 [available, amazon, ebay, disabilities, hidden...
10 [screw, every, follow, gets, nude, dms, dm, pr...
Your code is trying to lemmatize an actual list hence the error.
... for w in df -> here, w is the list, rather than each element of each list.
To get around this you could use pandas apply to pass each row to a function (assuming df is a pd.DataFrame and not a pd.Series. If it's a Series and the below doesn't work, try df = df.to_frame() first) :
def df_lemmatize(row):
lemmatizer = WordNetLemmatizer()
row.at['lemma_words'] = [lemmatizer.lemmatize(w, pos='a') for w in row.x]
return row
df = df.apply(df_lemmatize, axis=1)
df_lemmatize will iterate over each element in the list, lemmatize it and then add the new list to a new column lemma_words.
I am trying to count the frequency of words in a dataframe column, titled df['MESSAGETEXT'] as shown below). The code (from Stackoverflow) that I am working with is below:
from collections import Counter
import pandas as pd
import nltk
import string
top_N = 50
stopwords = nltk.corpus.stopwords.words('english')
# RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
# replace '|'-->' ' and drop all stopwords
words = (df['MESSAGETEXT']
.str.lower()
.replace([r'\|', RE_stopwords], [' ', ''], regex=True)
.str.cat(sep=' ')
.split()
)
# generate DF out of Counter
rslt = pd.DataFrame(Counter(words).most_common(top_N),
columns=['Word', 'Frequency']).set_index('Word')
print(rslt)
# plot
rslt.plot.bar(rot=0, figsize=(16,10), width=0.8)
Word Frequency count result is below and still has punctuation such as semicolon and fullstop.
Frequency
Word
' 89217
# 22231
london 20404
. 18271
- 13356
like 13153
! 10752
get 10501
& 10073
love 9720
; 9422
good 9168
one 8630
? 7943
day 7781
time 6956
know 6818
see 6811
u 6786
new 6553
think 6545
got 6330
go 6329
#london 5888
back 5801
great 5736
would 5611
x 5566
thanks 5553
people 5534
going 5464
need 5381
happy 5338
today 5040
still 4984
much 4883
thank 4766
want 4680
last 4664
well 4479
really 4444
lol 4376
please 4275
... 4210
de 4207
come 4120
even 4117
man 4094
best 4076
night 4047
I need to eliminate the following => ?,x,-,... etc
Without knowning a bit more details about the contents of the dataframe df, I suggest the following brute-force method:
Throw away any character that is not a letter with .replace([r'[^a-zA-Z]'], [' '], regex=True) and in is_valid check that words are not in stopwords and have at least length 3.
from collections import Counter
import pandas as pd
import nltk
top_N = 50
stopwords = set(nltk.corpus.stopwords.words('english'))
def is_valid(w):
return len(w) > 2 and w not in stopwords
if __name__ == '__main__':
data = {'MESSAGETEXT': [open("sampletext.txt").read()]}
df = pd.DataFrame(data, columns=['MESSAGETEXT'])
words = (df['MESSAGETEXT']
.str.lower()
.replace([r'[^a-zA-Z]'], [' '], regex=True)
.str.cat(sep=' ')
.split()
)
filtered_words = [w for w in words if is_valid(w)]
most_common = Counter(filtered_words).most_common(top_N)
rslt = pd.DataFrame(most_common, columns=['Word', 'Frequency']).set_index('Word')
print(rslt)
Without the need to modify your code, it could be simply adding two lines at the end of the code to remove the unwanted rows from the resulted DataFram (assume to be rslt) as in the next lines, either:
remov_list='=>?x-' # unwanted symbols
rslt=rslt[rslt['Word'].apply(lambda x: x not in remove_list)]
or
remov_list='=>?x-' # unwanted symbols
rslt=rslt.drop(rslt[rslt['Word'].apply(lambda x: x in remove_list)].index,axis=0)
I'm working with an xlsx file with pandas and I would like to add the word "bodypart" in a column if the preceding column contains a word in a predefined list of bodyparts.
Original Dataframe:
Sentence Type
my hand NaN
the fish NaN
Result Dataframe:
Sentence Type
my hand bodypart
the fish NaN
Nothing I've tried works. I feel I'm missing something very obvious. Here's my last (failed) attempt:
import pandas as pd
import numpy as np
bodyparts = ['lip ', 'lips ', 'foot ', 'feet ', 'heel ', 'heels ', 'hand ', 'hands ']
df = pd.read_excel(file)
for word in bodyparts :
if word in df["Sentence"] : df["Type"] = df["Type"].replace(np.nan, "bodypart", regex = True)
I also tried this, with as variants "NaN" and NaN as the first argument of str.replace:
if word in df['Sentence'] : df["Type"] = df["Type"].str.replace("", "bodypart")
Any help would be greatly appreciated!
You can create a regex to search on word boundaries and then use that as an argument to str.contains, eg:
import pandas as pd
import numpy as np
import re
bodyparts = ['lips?', 'foot', 'feet', 'heels?', 'hands?', 'legs?']
rx = re.compile('|'.join(r'\b{}\b'.format(el) for el in bodyparts))
df = pd.DataFrame({
'Sentence': ['my hand', 'the fish', 'the rabbit leg', 'hand over', 'something', 'cabbage', 'slippage'],
'Type': [np.nan] * 7
})
df.loc[df.Sentence.str.contains(rx), 'Type'] = 'bodypart'
Gives you:
Sentence Type
0 my hand bodypart
1 the fish NaN
2 the rabbit leg bodypart
3 hand over bodypart
4 something NaN
5 cabbage NaN
6 slippage NaN
A dirty solution would involve checking the intersection of two sets.
set A is your list of body parts, set B is the set of words in the sentence
df['Sentence']\
.apply(lambda x: 'bodypart' if set(x.split()) \
.symmetric_difference(bodyparts) else None)
The simplest way :
df.loc[df.Sentence.isin(bodyparts),'Type']='Bodypart'
Before you must discard space in bodyparts:
bodyparts = {'lip','lips','foot','feet','heel','heels','hand','hands'}
df.Sentence.isin(bodyparts) select the good rows, and Type the column to set. .loc is the indexer which permit the modification.