Search DataFrame column for words in list - python

I am trying to create a new DataFrame column that contains words that match between a list of keywords and strings in a df column...
data = {
'Sandwich Opinions':['Roast beef is overrated','Toasted bread is always best','Hot sandwiches are better than cold']
}
df = pd.DataFrame(data)
keywords = ['bread', 'bologna', 'toast', 'sandwich']
df['Matches'] = [df.apply(lambda x: ' '.join([i for i in df['Sandwich iOpinions'].str.split() if i in keywords]), axis=1)
This seems like it should do the job but it's getting stuck in endless processing.

for kw in keywords:
df[kw] = np.where(df['Sandwich Opinions'].str.contains(kw), 1, 0)
def add_contain_row(row):
contains = []
for kw in keywords:
if row[kw] == 1:
contains.append(kw)
return contains
df['contains'] = df.apply(add_contain_row, axis=1)
# if you want to drop the temp columns
df.drop(columns=keywords, inplace=True)

Create a regex pattern from your list of words:
import re
pattern = fr"\b({'|'.join(re.escape(k) for k in keywords)})\b"
df['contains'] = df['Sandwich Opinions'].str.extract(pattern, re.IGNORECASE)
Output:
>>> df
Sandwich Opinions contains
0 Roast beef is overrated NaN
1 Toasted bread is always best bread
2 Hot sandwiches are better than cold NaN

Related

Find if words from one sentence are found in corresponding row of another column also containing sentences (Pandas)

I have dataframe that looks like this:
email account_name
0 NaN weichert, realtors mnsota
1 jhawkins sterling group com sterling group
2 lbaltz baltzchevy com baltz chevrolet
and I have this code that works as a solution but it takes forever on larger datasets and I know there has to be an easier way to solve it so just looking to see if anyone knows of a more concise/elegant way to do find a count of matching words between corresponding rows of both columns. Thanks
test = prod_nb_wcomps_2.sample(3, random_state=10).reset_index(drop = True)
test = test[['email','account_name']]
print(test)
lst = []
for i in test.index:
if not isinstance(test['email'].iloc[i], float):
for word in test['email'].iloc[i].split(' '):
if not isinstance(test['account_name'].iloc[i], float):
for word2 in test['account_name'].iloc[i].split(' '):
if word in word2:
lst.append({'index':i, 'bool_col': True})
else: lst.append({'index':i, 'bool_col': False})
df_dct = pd.DataFrame(lst)
df_dct = df_dct.loc[df_dct['bool_col'] == True]
df_dct['number of matches_per_row'] = df_dct.groupby('index')['bool_col'].transform('size')
df_dct.set_index('index', inplace=True, drop=True)
df_dct.drop(['bool_col'], inplace=True, axis =1)
test_ = pd.merge(test, df_dct, left_index=True, right_index=True)
test_
the resulting dataframe test_ looks like this
This solves your query.
import pandas as pd
df = pd.DataFrame({'email': ['', 'jhawkins sterling group com', 'lbaltz baltzchevy com'], 'name': ['John', 'sterling group', 'Linda']})
for index, row in df.iterrows():
matches = sum([1 for x in row['email'].split() if x in row['name'].split()])
df.loc[index, 'matches'] = matches
Output:
email name matches
0 John 0.0
1 jhawkins sterling group com sterling group 2.0
2 lbaltz baltzchevy com Linda 0.0

check frequency of keyword in df in a text

I have a given text string:
text = """Alice has two apples and bananas. Apples are very healty."""
and a dataframe:
word
apples
bananas
company
I would like to add a column "frequency" which will count occurrences of each word in column "word" in the text.
So the output should be as below:
word
frequency
apples
2
bananas
1
company
0
import pandas as pd
df = pd.DataFrame(['apples', 'bananas', 'company'], columns=['word'])
para = "Alice has two apples and bananas. Apples are very healty.".lower()
df['frequency'] = df['word'].apply(lambda x : para.count(x.lower()))
word frequency
0 apples 2
1 bananas 1
2 company 0
Convert the text to lowercase and then use regex to convert it to a list of words. You might check out this page for learning purposes.
Loop through each row in the dataset and use lambda function to count the specific value in the previously created list.
# Import and create the data
import pandas as pd
import re
text = """Alice has two apples and bananas. Apples are very healty."""
df = pd.DataFrame(data={'word':['apples','bananas','company']})
# Solution
words_list = re.findall(r'\w+', text.lower())
df['Frequency'] = df['word'].apply(lambda x: words_list.count(x))

Pandas dataframe: select list items in a column, then transform string on the items

One of the columns I'm importing into my dataframe is structured as a list. I need to pick out certain values from said list, transform the value and add it to one of two new columns in the dataframe. Before:
Name
Listed_Items
Tom
["dr_md_coca_cola", "dr_od_water", "potatoes", "grass", "ot_other_stuff"]
Steve
["dr_od_orange_juice", "potatoes", "grass", "ot_other_stuff", "dr_md_pepsi"]
Phil
["dr_md_dr_pepper", "potatoes", "grass", "dr_od_coffee","ot_other_stuff"]
From what I've read I can turn the column into a list
df["listed_items"] = df["listed_items"].apply(eval)
But then I cannot see how to find any list items that start dr_md, extract the item, remove the starting dr_md, replace any underscores, capitalize the first letter and add that to a new MD column in the row. Then same again for dr_od. There is only one item in the list that starts dr_md and dr_od in each row. Desired output
Name
MD
OD
Tom
Coca Cola
Water
Steve
Pepsi
Orange Juice
Phil
Dr Pepper
Coffee
What you need to do is make a function that does the processing for you that you can pass into apply (or in this case, map). Alternatively, you could expand your list column into multiple columns and then process them afterwards, but that will only work if your lists are always in the same order (see panda expand columns with list into multiple columns). Because you only have one input column, you could use map instead of apply.
def process_dr_md(l:list):
for s in l:
if s.startswith("dr_md_"):
# You can process your string further here
return l[6:]
def process_dr_od(l:list):
for s in l:
if s.startswith("dr_od_"):
# You can process your string further here
return l[6:]
df["listed_items"] = df["listed_items"].map(eval)
df["MD"] = df["listed_items"].map(process_dr_md)
df["OD"] = df["listed_items"].map(process_dr_od)
I hope that gets you on your way!
Use pivot_table
df = df.explode('Listed_Items')
df = df[df.Listed_Items.str.contains('dr_')]
df['Type'] = df['Listed_Items'].str.contains('dr_md').map({True: 'MD',
False: 'OD'})
df.pivot_table(values='Listed_Items',
columns='Type',
index='Name',
aggfunc='first')
Type MD OD
Name
Phil dr_md_dr_pepper dr_od_coffee
Steve dr_md_pepsi dr_od_orange_juice
Tom dr_md_coca_cola dr_od_water
From here it's just a matter of beautifying your dataset as your wish.
I took a slightly different approach from the previous answers.
given a df of form:
Name Items
0 Tom [dr_md_coca_cola, dr_od_water, potatoes, grass...
1 Steve [dr_od_orange_juice, potatoes, grass, ot_other...
2 Phil [dr_md_dr_pepper, potatoes, grass, dr_od_coffe...
and making the following assumptions:
only one item in a list matches the target mask
the target mask always appears at the start of the entry string
I created the following function to parse the list:
import re
def parse_Items(tgt_mask: str, itmList: list) -> str:
p = re.compile(tgt_mask)
for itm in itmList:
if p.match(itm):
return itm[p.search(itm).span()[1]:].replace('_', ' ')
Then you can modify your original data farme by use of the following:
df['MD'] = [parse_Items('dr_md_', x) for x in df['Items'].to_list()]
df['OD'] = [parse_Items('dr_od_', x) for x in df['Items'].to_list()]
df.pop('Items')
This produces the following:
Name MD OD
0 Tom coca cola water
1 Steve pepsi orange juice
2 Phil dr pepper coffee
I would normalize de data before to put in a dataframe:
import pandas as pd
from typing import Dict, List, Tuple
def clean_stuff(text: str):
clean_text = text[6:].replace('_', ' ')
return " ".join([
word.capitalize()
for word in clean_text.split(" ")
])
def get_md_od(stuffs: List[str]) -> Tuple[str, str]:
md_od = [s for s in stuffs if s.startswith(('dr_md', 'dr_od'))]
md_od = sorted(md_od)
print(md_od)
return clean_stuff(md_od[0]), clean_stuff(md_od[1])
dirty_stuffs = [{'Name': 'Tom',
'Listed_Items': ["dr_md_coca_cola",
"dr_od_water",
"potatoes",
"grass",
"ot_other_stuff"]},
{'Name': 'Tom',
'Listed_Items': ["dr_md_coca_cola",
"dr_od_water",
"potatoes",
"grass",
"ot_other_stuff"]}
]
normalized_stuff: List[Dict[str, str]] = []
for stuff in dirty_stuffs:
md, od = get_md_od(stuff['Listed_Items'])
normalized_stuff.append({
'Name': stuff['Name'],
'MD': md,
'OD': od,
})
df = pd.DataFrame(normalized_stuff)
print(df)

Parse movie text data into a dataframe

I have some .txt data from a movie script that looks like this.
JOHN
Hi man. How are you?
TOM
A little hungry but okay.
JOHN
Let's get breakfast then
I'd like to parse out the text and create a dataframe with 2 columns. I for person e.g JOHN and TOM and a second column for the lines (which are the block of text below each name). The result would be like..
index | person | lines
0 | JOHN | "Hi man. How are you?"
1 | TOM | "A little hungry but okay."
2 | JOHN | "Let's get breakfast then"
I know I'm late to this party but this will parse an entire script into a dictionary of character name and their dialogue as values all you need to do then is df = pd.DataFrame(final_dict.values(), columns = final_dict.keys())
*# Grouped regex pattern to capture char and dialouge in a tuple
char_dialogue = re.compile(r"(?m)^\s*\b([A-Z]+)\b\s*\n(.*(?:\n.+)*)")
extract_dialogue = char_dialogue.findall(script)
final_dict = {}
for element in extract_dialogue:
# Seperating the character and dialogue from the tuple
char = element[0]
line = element[1]
# If the char is already a key in the dictionary
# and line is not empty append the dialogue to the value list
if char in final_dict:
if line != '':
final_dict[char].append(line)
else:
# Else add the character name to the dictionary keys with their first line
# Drop any lower case matches from group 0
# Can adjust the len here if you have characters with fewer letters
if char.isupper() and len(char) >2:
final_dict[char] = [line]
# Some final cleaning to drop empty dalouge
final_dict = {k: v for k, v in final_dict.items() if v != ['']}
# More filtering to reutrn only main characters with more than 50
# lines of dialogue
final_dict = {k: v for k, v in final_dict.items() if len(v) > 50}*
If every text is in one line then you can split to lines
lines = text.split('\n')
Remove spaces
lines = [x.strip() for x in lines]
And use slice [start:end:step] to create dataframe
df = pd.DataFrame({
'person': lines[0::3],
'lines': lines[1::3]
})
Example:
text = ''' JOHN
Hi man. How are you?
TOM
A little hungry but okay.
JOHN
Let's get breakfast then'''
lines = text.split('\n')
lines = [x.strip() for x in lines]
import pandas as pd
df = pd.DataFrame({
'person': lines[0::3],
'lines': lines[1::3]
})
print(df)
Result:
person lines
0 JOHN Hi man. How are you?
1 TOM A little hungry but okay.
2 JOHN Let's get breakfast then
If person may have text in many lines - ie.
JOHN
Hi man.
How are you?
then it needs more spliting and striping.
You can do it before creating DataFrame.
text = ''' JOHN
Hi man.
How are you?
TOM
A little hungry but okay.
JOHN
Let's get breakfast then'''
data = []
parts = text.split('\n\n')
for part in parts:
person, lines = part.split('\n', 1)
person = person.strip()
lines = "\n".join(x.strip() for x in lines.split('\n'))
data.append([person, lines])
import pandas as pd
df = pd.DataFrame(data)
df.columns = ['person', 'lines']
print(df)
Or you can try to do it after creating DataFrame
text = ''' JOHN
Hi man.
How are you?
TOM
A little hungry but okay.
JOHN
Let's get breakfast then'''
lines = text.split('\n\n')
lines = [x.split('\n', 1) for x in lines]
import pandas as pd
df = pd.DataFrame(lines)
df.columns = ['person', 'lines']
df['person'] = df['person'].str.strip()
df['lines'] = df['lines'].apply(lambda txt: "\n".join(x.strip() for x in txt.split('\n')))
print(df)
Resulst:
person lines
0 JOHN Hi man.\nHow are you?
1 TOM A little hungry but okay.
2 JOHN Let's get breakfast then

Python/pandas - Add word in column based on words in another column

I'm working with an xlsx file with pandas and I would like to add the word "bodypart" in a column if the preceding column contains a word in a predefined list of bodyparts.
Original Dataframe:
Sentence Type
my hand NaN
the fish NaN
Result Dataframe:
Sentence Type
my hand bodypart
the fish NaN
Nothing I've tried works. I feel I'm missing something very obvious. Here's my last (failed) attempt:
import pandas as pd
import numpy as np
bodyparts = ['lip ', 'lips ', 'foot ', 'feet ', 'heel ', 'heels ', 'hand ', 'hands ']
df = pd.read_excel(file)
for word in bodyparts :
if word in df["Sentence"] : df["Type"] = df["Type"].replace(np.nan, "bodypart", regex = True)
I also tried this, with as variants "NaN" and NaN as the first argument of str.replace:
if word in df['Sentence'] : df["Type"] = df["Type"].str.replace("", "bodypart")
Any help would be greatly appreciated!
You can create a regex to search on word boundaries and then use that as an argument to str.contains, eg:
import pandas as pd
import numpy as np
import re
bodyparts = ['lips?', 'foot', 'feet', 'heels?', 'hands?', 'legs?']
rx = re.compile('|'.join(r'\b{}\b'.format(el) for el in bodyparts))
df = pd.DataFrame({
'Sentence': ['my hand', 'the fish', 'the rabbit leg', 'hand over', 'something', 'cabbage', 'slippage'],
'Type': [np.nan] * 7
})
df.loc[df.Sentence.str.contains(rx), 'Type'] = 'bodypart'
Gives you:
Sentence Type
0 my hand bodypart
1 the fish NaN
2 the rabbit leg bodypart
3 hand over bodypart
4 something NaN
5 cabbage NaN
6 slippage NaN
A dirty solution would involve checking the intersection of two sets.
set A is your list of body parts, set B is the set of words in the sentence
df['Sentence']\
.apply(lambda x: 'bodypart' if set(x.split()) \
.symmetric_difference(bodyparts) else None)
The simplest way :
df.loc[df.Sentence.isin(bodyparts),'Type']='Bodypart'
Before you must discard space in bodyparts:
bodyparts = {'lip','lips','foot','feet','heel','heels','hand','hands'}
df.Sentence.isin(bodyparts) select the good rows, and Type the column to set. .loc is the indexer which permit the modification.

Categories