Dynamically count occurences of multiple words within lists - python

I'm trying to count the occurences of multiple keywords within each phrases of a dataframe. This seems similar to other questions but not quite the same.
Here we have a df and a list of lists containing keywords/topics:
df=pd.DataFrame({'phrases':['very expensive meal near city center','very good meal and waiters','nice restaurant near center and public transport']})
topics=[['expensive','city'],['good','waiters'],['center','transport']]
for each phrase, we want to count how many words match in each separate topic. So the first phrase should score 2 for 1st topic, 0 for 2nd topic and 1 for 3rd topic, etc
I've tried this but it does not work:
from collections import Counter
topnum=0
for t in topics:
counts=[]
topnum+=1
results = Counter()
for line in df['phrases']:
for c in line.split(' '):
results[c] = t.count(c)
counts.append(sum(results.values()))
df['topic_'+str(topnum)] = counts
I'm not sure what i'm doing wrong, ideally i would end up with a count of matching words for each topic/phrases combinations but instead the counts seem to repeat themselves:
phrases topic_1 topic_2 topic_3
very expensive meal near city centre 2 0 0
very good meal and waiters 2 2 0
nice restaurant near center and public transport 2 2 2
Many thanks to whoever can help me.
Best Wishes

Here is a solution that defines a helper function called find_count and applies it as a lambda to the dataframe.
import pandas as pd
df=pd.DataFrame({'phrases':['very expensive meal near city center','very good meal and waiters','nice restaurant near center and public transport']})
topics=[['expensive','city'],['good','waiters'],['center','transport']]
def find_count(row, topics_index):
count = 0
word_list = row['phrases'].split()
for word in word_list:
if word in topics[topics_index]:
count+=1
return count
df['Topic 1'] = df.apply(lambda row:find_count(row,0), axis=1)
df['Topic 2'] = df.apply(lambda row:find_count(row,1), axis=1)
df['Topic 3'] = df.apply(lambda row:find_count(row,2), axis=1)
print(df)
#Output
phrases Topic 1 Topic 2 Topic 3
0 very expensive meal near city center 2 0 1
1 very good meal and waiters 0 2 0
2 nice restaurant near center and public transport 0 0 2

Related

Python categorize data in excel based on key words from another excel sheet

I have two excel sheets, one has four different types of categories with keywords listed. I am using Python to find the keywords in the review data and match them to a category. I have tried using pandas and data frames to compare but I get errors like "DataFrame objects are mutable, thus they cannot be hashed". I'm not sure if there is a better way but I am new to Pandas.
Here is an example:
Category sheet
Service
Experience
fast
bad
slow
easy
Data Sheet
Review #
Location
Review
1
New York
"The service was fast!
2
Texas
"Overall it was a bad experience for me"
For the examples above I would expect the following as a result.
I would expect review 1 to match the category Service because of the word "fast" and I would expect review 2 to match category Experience because of the word "bad". I do not expect the review to match every word in the category sheet, and it is fine if one review belongs to more than one category.
Here is my code, note I am using a simple example. In the example below I am trying to find the review data that would match the Customer Service list of keywords.
import pandas as pd
# List of Categories
cat = pd.read_excel("Categories_List.xlsx")
# Data being used
data = pd.read_excel("Data.xlsx")
# Data Frame for review column
reviews = pd.DataFrame(data["reviews"])
# Data Frame for Categories
cs = pd.DataFrame(cat["Customer Service"])
be = pd.DataFrame(cat["Billing Experience"])
net = pd.DataFrame(cat["Network"])
out = pd.DataFrame(cat["Outcome"])
for i in reviews:
if cs in reviews:
print("True")
One approach would be to build a regular expression from the cat frame:
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cat])
(?P<Service>fast|slow)|(?P<Experience>bad|easy)
Alternatively replace cat with a list of columns to test:
cols = ['Service']
exp = '|'.join([rf'(?P<{col}>{"|".join(cat[col].dropna())})' for col in cols])
(?P<Service>fast|slow|quick)
Then to get matches use str.extractall and aggregate into summary + join to add back to the reviews frame:
Aggregated into List:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: list(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! [fast] [easy]
1 2 Texas Overall it was a bad experience for me [] [bad]
Aggregated into String:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).groupby(level=0).agg(
lambda g: ', '.join(g.dropna()))
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! fast easy
1 2 Texas Overall it was a bad experience for me bad
Alternatively for an existence test use any on level=0:
reviews = reviews.join(
reviews['Review'].str.extractall(exp).any(level=0)
)
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True
Or iteratively over the columns and with str.contains:
cols = cat.columns
for col in cols:
reviews[col] = reviews['Review'].str.contains('|'.join(cat[col].dropna()))
Review # Location Review Service Experience
0 1 New York The service was fast and easy! True True
1 2 Texas Overall it was a bad experience for me False True

If text is contained in another dataframe then flag row with a binary designation

I'm working on mining survey data. I was able to flag the rows for certain keywords:
survey['Rude'] = survey['Comment Text'].str.contains('rude', na=False, regex=True).astype(int)
Now, I want to flag any rows containing names. I have another dataframe that contains common US names.
Here's what I thought would work, but it is not flagging any rows, and I have validated that names do exist in the 'Comment Text'
for row in survey:
for word in survey['Comment Text']:
survey['Name'] = 0
if word in names['Name']:
survey['Name'] = 1
You are not looping through the series correctly. for row in survey: loops through the column names in survey. for word in survey['Comment Text']: loops though the comment strings. survey['Name'] = 0 creates a column of all 0s.
You could use set intersections and apply(), to avoid all the looping through rows:
survey = pd.DataFrame({'Comment_Text':['Hi rcriii',
'Hi yourself stranger',
'say hi to Justin for me']})
names = pd.DataFrame({'Name':['rcriii', 'Justin', 'Susan', 'murgatroyd']})
s2 = set(names['Name'])
def is_there_a_name(s):
s1 = set(s.split())
if len(s1.intersection(s2))>0:
return 1
else:
return 0
survey['Name'] = survey['Comment_Text'].apply(is_there_a_name)
print(names)
print(survey)
Name
0 rcriii
1 Justin
2 Susan
3 murgatroyd
Comment_Text Name
0 Hi rcriii 1
1 Hi yourself stranger 0
2 say hi to Justin for me 1
As a bonus, return len(s1.intersection(s2)) to get the number of matches per line.

How to create a new column in pandas dataframe with different replacement of a part of the string in each row?

I have 3 different columns in different dataframes that look like this.
Column 1 has sentence templates, e.g. "He would like to [action] this week".
Column 2 has pairs of words, e.g. "exercise, swim".
The 3d column has the type for the word pair, e.g. [action].
I assume there should be something similar to "melt" in R, but I'm not sure how to do the replacement.
I would like to create a new column/dataframe which will have all the possible options for each sentence template (one sentence per row):
He would like to exercise this week.
He would like to swim this week.
The number of templates is significantly lower than the number of words I have. There are several types of word pairs (action, description, object, etc).
#a simple example of what I would like to achieve
import pandas as pd
#input1
templates = pd.DataFrame(columns=list('AB'))
templates.loc[0] = [1,'He wants to [action] this week']
templates.loc[1] = [2,'She noticed a(n) [object] in the distance']
templates
#input 2
words = pd.DataFrame(columns=list('AB'))
words.loc[0] = ['exercise, swim', 'action']
words.loc[1] = ['bus, shop', 'object']
words
#output
result = pd.DataFrame(columns=list('AB'))
result.loc[0] = [1, 'He wants to exercise this week']
result.loc[1] = [2, 'He wants to swim this week']
result.loc[2] = [3, 'She noticed a(n) bus in the distance']
result.loc[3] = [4, 'She noticed a(n) shop in the distance']
result
First create new columns by Series.str.extract with words from words['B'] and then Series.map for values for replacement:
pat = '|'.join(r"\[{}\]".format(re.escape(x)) for x in words['B'])
templates['matched'] = templates['B'].str.extract('('+ pat + ')', expand=False).fillna('')
templates['repl'] =(templates['matched'].map(words.set_index('B')['A']
.rename(lambda x: '[' + x + ']'))).fillna('')
print (templates)
A B matched repl
0 1 He wants to [action] this week [action] exercise, swim
1 2 She noticed a(n) [object] in the distance [object] bus, shop
And then replace in list comprehension:
z = zip(templates['B'],templates['repl'], templates['matched'])
result = pd.DataFrame({'B':[a.replace(c, y) for a,b,c in z for y in b.split(', ')]})
result.insert(0, 'A', result.index + 1)
print (result)
A B
0 1 He wants to exercise this week
1 2 He wants to swim this week
2 3 She noticed a(n) bus in the distance
3 4 She noticed a(n) shop in the distance

Count match in 2 pandas dataframes

I have 2 dataframes containing text as list in each row. This one is called df
Datum File File_type Text
Datum
2000-01-27 2000-01-27 0864820040_000127_04.txt _04 [business, date, jan, heineken, starts, integr..
and i have another one, df_lm which looks like this
List_type Words
0 LM_cnstrain. [abide, abiding, bound, bounded, commit, commi...
1 LM_litigius. [abovementioned, abrogate, abrogated, abrogate...
2 LM_modal_me. [can, frequently, generally, likely, often, ou...
3 LM_modal_st. [always, best, clearly, definitely, definitive...
4 LM_modal_wk. [almost, apparently, appeared, appearing, appe...
I want to create new columns in df, where the match of words should be counted, so for example how many words are there from df_lm.Words[0] in df.Text[0]
Note: df has ca 500 rows and df_lm has 6 -> so i need to create 6 new columns in df so that the updated df looks somewhat like this
Datum ...LM_cnstrain LM_litigius Lm_modal_me ...
2000-01-27 ... 5 3 4
2000-02-25 ... 7 1 0
I hope i was clear on my question.
Thanks in advance!
EDIT:
i have already done smth. similar by creating a list and loop over it, but as the lists in df_lm are very long this is not an option.
The code looked like this:
result_list[]
for file in file_list:
count_growth = 0
for word in text.split ():
if word in growth:
count_growth = count_growth +1
a={'Grwoth':count_growth}
result_list.append(a)
According to my comments you can try something like this:
The below code has to run in a loop where text column from 1st df has to be matched with all 6 from next and make column with value from len(c)
desc = df_lm.iloc[0,1]
matches = df.text.isin(desc)
result = df.text[matches]
If this helps you, let me know otherwise will update/delete the answer
So ive come to the following solution:
for file in file_list:
count_lm_constraint = 0
count_lm_litigious = 0
count_lm_modal_me = 0
for word in text.split()
if word in df_lm.iloc[0,1]:
count_lm_constraint = count_lm_constraint +1
if word in df_lm.iloc[1,1]:
count_lm_litigious = count_lm_litigious +1
if word in df_lm.iloc[2,1]:
count_lm_modal_me = count_lm_modal_me +1
a={"File": name, "Text": text,'lm_uncertain':count_lm_uncertain,'lm_positive':count_lm_positive ....}
result_list.append(a)

How to stack the wthin in a pandas dataframe carrying out its reference?

I have a large pandas dataframe with a lot of documents:
id text
1 doc2 Google i...
2 doc3 Amazon...
3 doc4 This was...
...
n docN nice camara...
How can I stack all the documents into sentences carrying out their respective id?:
id text
1 doc1 Google is a great company.
2 doc1 It is in silicon valley.
3 doc1 Their search engine is the best
4 doc2 Amazon is a great store.
5 doc2 it is located in Seattle.
6 doc2 its new product is alexa.
5 doc2 its expensive.
5 doc3 This was a great product.
...
n docN nice camara I really liked it.
I tried to:
import nltk
def sentence(document):
sentences = nltk.sent_tokenize(document.strip(' '))
return sentences
df['sentece'] = df['text'].apply(sentence)
df.stack(level=0)
However, it did not worked. Any idea of how to stack the sentences carrying out their id of provenance?.
There is a solution to the problem that is similar to yours here: pandas: When cell contents are lists, create a row for each element in the list. Here's my interpretation of it with respect to your particular task:
df['sents'] = df['text'].apply(lambda x: nltk.sent_tokenize(x))
s = df.apply(lambda x: pd.Series(x['sents']), axis=1).stack().\
reset_index(level=1, drop=True)
s.name = 'sents'
df = df.drop(['sents','text'], axis=1).join(s)
This iterates over each sentences with apply so that it can use nltk.sent_tokenize. Then it converts all the sentences into their own columns using the Series constructor.
df1 = df['text'].apply(lambda x: pd.Series(nltk.sent_tokenize(x)))
df1.set_index(df['id']).stack()
Example with fake data
df=pd.DataFrame({'id':['doc1', 'doc2'], 'text' :['This is a sentence. And another. And one more. cheers',
'here are more sentences. yipee. woop.']})
df1 = df['text'].apply(lambda x: pd.Series(nltk.sent_tokenize(x)))
df1.set_index(df['id']).stack().reset_index().drop('level_1', axis=1)
id 0
0 doc1 This is a sentence.
1 doc1 And another.
2 doc1 And one more.
3 doc1 cheers
4 doc2 here are more sentences.
5 doc2 yipee.
6 doc2 woop.
I think you would find this a lot easier if you kept your corps not in pandas. Here is my solution. I fit it back into a pandas data frame in the end. I think this is probably the most scalable solution.
def stack(one, two):
sp = two.split(".")
return [(one, a.strip()) for a in sp if len(a.strip()) > 0]
st = sum(map(stack, df['id'].tolist(),df['text'].tolist()),[])
df2 = pd.DataFrame(st)
df2.columns = ['id','text']
If you want to add a sentence Id column you can make a small tweak.
def stack(one, two):
sp = two.split(".")
return [(one, b, a.strip()) for a,b in zip(sp,xrange(1,len(sp)+1)) if len(a.strip()) > 0]
st = sum(map(stack, df['id'].tolist(),df['text'].tolist()),[])
df2 = pd.DataFrame(gen)
df2.columns = ['id','sentence_id','text']

Categories