check frequency of keyword in df in a text - python

I have a given text string:
text = """Alice has two apples and bananas. Apples are very healty."""
and a dataframe:
word
apples
bananas
company
I would like to add a column "frequency" which will count occurrences of each word in column "word" in the text.
So the output should be as below:
word
frequency
apples
2
bananas
1
company
0

import pandas as pd
df = pd.DataFrame(['apples', 'bananas', 'company'], columns=['word'])
para = "Alice has two apples and bananas. Apples are very healty.".lower()
df['frequency'] = df['word'].apply(lambda x : para.count(x.lower()))
word frequency
0 apples 2
1 bananas 1
2 company 0

Convert the text to lowercase and then use regex to convert it to a list of words. You might check out this page for learning purposes.
Loop through each row in the dataset and use lambda function to count the specific value in the previously created list.
# Import and create the data
import pandas as pd
import re
text = """Alice has two apples and bananas. Apples are very healty."""
df = pd.DataFrame(data={'word':['apples','bananas','company']})
# Solution
words_list = re.findall(r'\w+', text.lower())
df['Frequency'] = df['word'].apply(lambda x: words_list.count(x))

Related

Search DataFrame column for words in list

I am trying to create a new DataFrame column that contains words that match between a list of keywords and strings in a df column...
data = {
'Sandwich Opinions':['Roast beef is overrated','Toasted bread is always best','Hot sandwiches are better than cold']
}
df = pd.DataFrame(data)
keywords = ['bread', 'bologna', 'toast', 'sandwich']
df['Matches'] = [df.apply(lambda x: ' '.join([i for i in df['Sandwich iOpinions'].str.split() if i in keywords]), axis=1)
This seems like it should do the job but it's getting stuck in endless processing.
for kw in keywords:
df[kw] = np.where(df['Sandwich Opinions'].str.contains(kw), 1, 0)
def add_contain_row(row):
contains = []
for kw in keywords:
if row[kw] == 1:
contains.append(kw)
return contains
df['contains'] = df.apply(add_contain_row, axis=1)
# if you want to drop the temp columns
df.drop(columns=keywords, inplace=True)
Create a regex pattern from your list of words:
import re
pattern = fr"\b({'|'.join(re.escape(k) for k in keywords)})\b"
df['contains'] = df['Sandwich Opinions'].str.extract(pattern, re.IGNORECASE)
Output:
>>> df
Sandwich Opinions contains
0 Roast beef is overrated NaN
1 Toasted bread is always best bread
2 Hot sandwiches are better than cold NaN

Group by dataframe using python and pandas

Lets say i have df like this
ID
name_x
st
string
1
xx
us
Being unacquainted with the chief raccoon was harming his prospects for promotion
2
xy
us1
The overpass went under the highway and into a secret world
3
xz
us
He was 100% into fasting with her until he understood that meant he couldn't eat
4
xu
us2
Random words in front of other random words create a random sentence
5
xi
us1
All you need to do is pick up the pen and begin
Using python and pandas for column st I want count name_x values and then extract top 3 key words from string.
For example like this:
st
name_x_count
top1_word
top2_word
top3_word
us
2
word1
word2
word3
us1
2
word1
word2
word3
us2
1
word1
word2
word3
Is there any way to solve this task?
I would first groupby() to concatenate the strings as you show and then use collections Counter and then most_common. Finally assign it back to the dataframe. I am using x.lower() because otherwise "He" and "he" will be considered a different word (but you can always remove it if this is intended):
output = df.groupby('st').agg(
name_x_count = pd.NamedAgg('name_x','count'),
string = pd.NamedAgg('string',' '.join))
After grouping by we create the columns by using collections.Counter():
output[['top1_word','top2_word','top3_word']] = output['string'].map(lambda x: [x[0] for x in collections.Counter(x.lower().split()).most_common(3)])
output = output.drop(columns='string')
Output:
name_x_count top1_word top2_word top3_word
st
us 2 he with was
us1 2 the and overpass
us2 1 random words in
First, I added a space at the end of each string, as we will combine sentences while grouping. Then I consolidated the sentences after grouping by the st column.
df['string']=df['string'] + ' ' # we will use sum function. When combining sentences, there should be spaces in between.
dfx=df.groupby('st').agg({'st':'count','string':'sum'}) #groupby st and combine strings
Then list each word of the string expression, calculate their distribution and get the first 3 values.
from collections import Counter
mask=dfx['string'].apply(lambda x: list(dict(Counter(x.split()).most_common()[:3]).keys()))
print(mask)
'''
st string
us ['with', 'was', 'he']
us1 ['the', 'and', 'The']
us2 ['words', 'random', 'Random']
'''
Finally, add these first 3 words as new columns.
dfx[['top1_word','top2_word','top3_word']]=pd.DataFrame(mask.tolist(), index= mask.index)
dfx
st name_x_count top1_word top2_word top3_word
us 2 with was he
us1 2 the and The
us2 1 words random Random

If text is contained in another dataframe then flag row with a binary designation

I'm working on mining survey data. I was able to flag the rows for certain keywords:
survey['Rude'] = survey['Comment Text'].str.contains('rude', na=False, regex=True).astype(int)
Now, I want to flag any rows containing names. I have another dataframe that contains common US names.
Here's what I thought would work, but it is not flagging any rows, and I have validated that names do exist in the 'Comment Text'
for row in survey:
for word in survey['Comment Text']:
survey['Name'] = 0
if word in names['Name']:
survey['Name'] = 1
You are not looping through the series correctly. for row in survey: loops through the column names in survey. for word in survey['Comment Text']: loops though the comment strings. survey['Name'] = 0 creates a column of all 0s.
You could use set intersections and apply(), to avoid all the looping through rows:
survey = pd.DataFrame({'Comment_Text':['Hi rcriii',
'Hi yourself stranger',
'say hi to Justin for me']})
names = pd.DataFrame({'Name':['rcriii', 'Justin', 'Susan', 'murgatroyd']})
s2 = set(names['Name'])
def is_there_a_name(s):
s1 = set(s.split())
if len(s1.intersection(s2))>0:
return 1
else:
return 0
survey['Name'] = survey['Comment_Text'].apply(is_there_a_name)
print(names)
print(survey)
Name
0 rcriii
1 Justin
2 Susan
3 murgatroyd
Comment_Text Name
0 Hi rcriii 1
1 Hi yourself stranger 0
2 say hi to Justin for me 1
As a bonus, return len(s1.intersection(s2)) to get the number of matches per line.

Text analysis: finding the most common word in a column using python

I have created a dataframe with just a column with the subject line.
df = activities.filter(['Subject'],axis=1)
df.shape
This returned this dataframe:
Subject
0 Call Out: Quadria Capital - May Lo, VP
1 Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
2 Columbia Partners: WW Worked (Not Sure Will Ev...
3 Meeting, Sophie, CFO, CDC Investment
4 Prospecting
I then tried to analyse the text with this code:
import nltk
top_N = 50
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
words = nltk.tokenize.word_tokenize(txt)
word_dist = nltk.FreqDist(words)
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)
rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
print(rslt)
The error message I get is: 'Series' object has no attribute 'Subject'
The error is being thrown because you have converted df to a Series in this line:
df = activities.filter(['Subject'],axis=1)
So when you say:
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
df is the Series and does not have the attribute Series. Try replacing with:
txt = df.str.lower().str.replace(r'\|', ' ')
Or alternatively don't filter your DataFrame to a single Series before and then
txt = df.Subject.str.lower().str.replace(r'\|', ' ')
should work.
[UPDATE]
What I said above is incorrect, as pointed out filter does not return a Series, but rather a DataFrame with a single column.
Data:
Subject
"Call Out: Quadria Capital - May Lo, VP"
Call Out: Revelstoke - Anthony Hayes (Sr Assoc...
Columbia Partners: WW Worked (Not Sure Will Ev...
"Meeting, Sophie, CFO, CDC Investment"
Prospecting
# read in the data
df = pd.read_clipboard(sep=',')
Updated code:
Convert all words to lowercase and remove all non-alphanumeric characters
txt = df.Subject.str.lower().str.replace(r'\|', ' ') creates a pandas.core.series.Series and will be replaced
words = nltk.tokenize.word_tokenize(txt), throws a TypeError because txt is a Series.
The following code tokenizes each row of the dataframe
Tokenizing the words, splits each string into a list. In this example, looking at df will show a tok column, where each row is a list
import nltk
import pandas as pd
top_N = 50
# replace all non-alphanumeric characters
df['sub_rep'] = df.Subject.str.lower().str.replace('\W', ' ')
# tokenize
df['tok'] = df.sub_rep.apply(nltk.tokenize.word_tokenize)
To analyze all the words in the column, the individual rows lists are combined into a single list, called words.
# all tokenized words to a list
words = df.tok.tolist() # this is a list of lists
words = [word for list_ in words for word in list_]
# frequency distribution
word_dist = nltk.FreqDist(words)
# remove stopwords
stopwords = nltk.corpus.stopwords.words('english')
words_except_stop_dist = nltk.FreqDist(w for w in words if w not in stopwords)
# output the results
rslt = pd.DataFrame(word_dist.most_common(top_N), columns=['Word', 'Frequency'])
Output rslt:
Word Frequency
call 2
out 2
quadria 1
capital 1
may 1
lo 1
vp 1
revelstoke 1
anthony 1
hayes 1
sr 1
assoc 1
columbia 1
partners 1
ww 1
worked 1
not 1
sure 1
will 1
ev 1
meeting 1
sophie 1
cfo 1
cdc 1
investment 1
prospecting 1

Pandas keeps converting strings to int

I have the following code from this question Df groupby set comparison:
import pandas as pd
wordlist = pd.read_csv('data/example.txt', sep='\r', header=None, index_col=None, names=['word'])
wordlist = wordlist.drop_duplicates(keep='first')
# wordlist['word'] = wordlist['word'].astype(str)
wordlist['split'] = ''
wordlist['anagrams'] = ''
for index, row in wordlist.iterrows() :
row['split'] = list(row['word'])
anaglist = wordlist['anagrams'] = wordlist['word'].apply(lambda x: ''.join(sorted(list(x))))
wordlist['anagrams'] = anaglist
wordlist = wordlist.drop(['split'], axis=1)
wordlist = wordlist['anagrams'].drop_duplicates(keep='first')
print(wordlist)
print(wordlist.dtypes)
Some input in my example.txt file seems to be being read as ints, particularly if the strings are of different character lengths. I can't seem to force pandas to see the data as strings using .astype(str)
What's going on?
First for force read column to string is possible use parameter dtype=str in read_csv, but it is used if numeric columns is necessary explicitly converting. So it seems because string values all values in column are converted to str implicitly.
I try a bit change your code:
Setup:
import pandas as pd
import numpy as np
temp=u'''"acb"
"acb"
"bca"
"foo"
"oof"
"spaniel"'''
#after testing replace 'pd.compat.StringIO(temp)' to 'example.txt'
wordlist = pd.read_csv(pd.compat.StringIO(temp), sep="\r", index_col=None, names=['word'])
print (wordlist)
word
0 acb
1 acb
2 bca
3 foo
4 oof
5 spaniel
#first remove duplicates
wordlist = wordlist.drop_duplicates()
#create lists and join them
wordlist['anagrams'] = wordlist['word'].apply(lambda x: ''.join(sorted(list(x))))
print (wordlist)
word anagrams
0 acb abc
2 bca abc
3 foo foo
4 oof foo
5 spaniel aeilnps
#sort DataFrame by column anagrams
wordlist = wordlist.sort_values('anagrams')
#get first duplicated rows
wordlist1 = wordlist[wordlist['anagrams'].duplicated()]
print (wordlist1)
word anagrams
2 bca abc
4 oof foo
#get all duplicated rows
wordlist2 = wordlist[wordlist['anagrams'].duplicated(keep=False)]
print (wordlist2)
word anagrams
0 acb abc
2 bca abc
3 foo foo
4 oof foo

Categories