How to assign multiple categories based on a condition - python

Here are the categories each with a list of words ill be checking the rows for match:
fashion = ['bag','purse','pen']
general = ['knob','hanger','bottle','printing','tissues','book','tissue','holder','heart']
decor =['holder','decoration','candels','frame','paisley','bunting','decorations','party','candles','design','clock','sign','vintage','hanging','mirror','drawer','home','clusters','placements','willow','stickers','box']
kitchen = ['pantry','jam','cake','glass','bowl','napkins','kitchen','baking','jar','mug','cookie','bowl','placements','molds','coaster','placemats']
holiday = ['rabbit','ornament','christmas','trinket','party']
garden = ['lantern','hammok','garden','tree']
kids = ['children','doll','birdie','asstd','bank','soldiers','spaceboy','childs']
Here is my code: (I am checking sentences for keywords and assign the row a category accordingly. I want to allow overlapping, so one row could have more than one category)
#check if description row contains words from one of our category lists
df['description'] = np.select(
[
(df['description'].str.contains('|'.join(fashion))),
(df['description'].str.contains('|'.join(general))),
(df['description'].str.contains('|'.join(decor))),
(df['description'].str.contains('|'.join(kitchen))),
(df['description'].str.contains('|'.join(holiday))),
(df['description'].str.contains('|'.join(garden))),
(df['description'].str.contains('|'.join(kids)))
],
['fashion','general','decor','kitchen','holiday','garden','kids'],
'Other'
)
Current Output:
index description category
0 children wine glass kids
1 candles decor
2 christmas tree holiday
3 bottle general
4 soldiers kids
5 bag fashion
Expected Output:
index description category
0 children wine glass kids, kitchen
1 candles decor
2 christmas tree holiday, garden
3 bottle general
4 soldiers kids
5 bag fashion

Here's an option using apply():
df = pd.DataFrame({'description': ['children wine glass',
'candles',
'christmas tree',
'bottle',
'soldiers',
'bag']})
def categorize(desc):
lst = []
for w in desc.split(' '):
if w in fashion:
lst.append('fashion')
if w in general:
lst.append('general')
if w in decor:
lst.append('decor')
if w in kitchen:
lst.append('kitchen')
if w in holiday:
lst.append('holiday')
if w in garden:
lst.append('garden')
if w in kids:
lst.append('kids')
return ', '.join(lst)
df.apply(lambda x: categorize(x.description), axis=1)
Outuput:
0 kids, kitchen
1 decor
2 holiday, garden
3 general
4 kids
5 fashion

Here's how I would do it.
Comments above each line provides you details on what I am trying to do.
Steps:
Convert all the categories into key:value pair. Use values in the
category as key and the category as value. This is to enable you to
search for the value and map it back to key
Split the description field into multiple columns using
split(expand)
Do a match for key value on each column. The result will be
categories and NaNs
Join all of these back into a column with ', ' separated to get final result while excluding NaNs. Apply pd.unique() on it again to remove duplicate categories
The six lines of code you need are:
dict_keys = ['fashion','general','decor','kitchen','holiday','garden','kids']
dict_cats = [fashion,general,decor,kitchen,holiday,garden,kids]
s_dict = {val:dict_keys[i] for i,cats in enumerate(dict_cats) for val in cats}
temp = df['description'].str.split(expand=True)
temp = temp.applymap(s_dict.get)
df['new_category'] = temp.apply(lambda x: ','.join(x[x.notnull()]), axis = 1)
df['new_category'] = df['new_category'].apply(lambda x: ', '.join(pd.unique(x.split(','))))
If you have more categories, just add it to dict_keys and dict_cats. Everything else stays the same.
The full code with comments begins here:
import pandas as pd
c = ['description','category']
d = [['children wine glass','kids'],
['candles','decor'],
['christmas tree','holiday'],
['bottle','general'],
['soldiers','kids'],
['bag','fashion']]
df = pd.DataFrame(d,columns = c)
fashion = ['bag','purse','pen']
general = ['knob','hanger','bottle','printing','tissues','book','tissue','holder','heart']
decor =['holder','decoration','candels','frame','paisley','bunting','decorations','party','candles','design','clock','sign','vintage','hanging','mirror','drawer','home','clusters','placements','willow','stickers','box']
kitchen = ['pantry','jam','cake','glass','bowl','napkins','kitchen','baking','jar','mug','cookie','bowl','placements','molds','coaster','placemats']
holiday = ['rabbit','ornament','christmas','trinket','party']
garden = ['lantern','hammok','garden','tree']
kids = ['children','doll','birdie','asstd','bank','soldiers','spaceboy','childs']
#create a list of all the lists
dict_keys = ['fashion','general','decor','kitchen','holiday','garden','kids']
dict_cats = [fashion,general,decor,kitchen,holiday,garden,kids]
#create a dictionary with words from the list as key and category as value
s_dict = {val:dict_keys[i] for i,cats in enumerate(dict_cats) for val in cats}
#create a temp dataframe with one word for each column using split
temp = df['description'].str.split(expand=True)
#match the words in each column against the dictionary
temp = temp.applymap(s_dict.get)
#Now put them back together and you have the final list
df['new_category'] = temp.apply(lambda x: ','.join(x[x.notnull()]), axis = 1)
#Remove duplicates using pd.unique()
#Note: prev line join modified to ',' from ', '
df['new_category'] = df['new_category'].apply(lambda x: ', '.join(pd.unique(x.split(','))))
print (df)
The output of this will be: (I kept your category column and created new one called new_category
description category new_category
0 children wine glass kids kids, kitchen
1 candles decor decor
2 christmas tree holiday holiday, garden
3 bottle general general
4 soldiers kids kids
5 bag fashion fashion
The output including 'party candles holder' is :
description category new_category
0 children wine glass kids kids, kitchen
1 candles decor decor
2 christmas tree holiday holiday, garden
3 bottle general general
4 party candles holder None holiday, decor
5 soldiers kids kids
6 bag fashion fashion

Related

str.contains not working when there is not a space between the word and special character

I have a dataframe which includes the names of movie titles and TV Series.
From specific keywords I want to classify each row as Movie or Title according to these key words. However, due to brackets not having a space between the key words they are not being picked up by the str.contains() funtion and I need to do a workaround.
This is my dataframe:
import pandas as pd
import numpy as np
watched_df = pd.DataFrame([['Love Death Robots (Episode 1)'],
['James Bond'],
['How I met your Mother (Avnsitt 3)'],
['random name'],
['Random movie 3 Episode 8383893']],
columns=['Title'])
watched_df.head()
To add the column that classifies the titles as TV series or Movies I have the following code.
watched_df["temporary_brackets_removed_title"] = watched_df['Title'].str.replace('(', '')
watched_df["Film_Type"] = np.where(watched_df.temporary_brackets_removed_title.astype(str).str.contains(pat = 'Episode | Avnsitt', case = False), 'Series', 'Movie')
watched_df = watched_df.drop('temporary_brackets_removed_title', 1)
watched_df.head()
Is there a simpler way to solve this without having to add and drop a column?
Maybe a str.contains-like function that does not look at a string being the exact same but just containing the given word? Similar to how in SQL you have the "Like" functionality?
You can use str.contains and then map the results:
watched_df['Film_Type'] = watched_df['Title'].str.contains(r'(?:Episode|Avnsitt)').map({True: 'Series', False: 'Movie'})
Output:
>>> watched_df
Title Film_Type
0 Love Death Robots (Episode 1) Series
1 James Bond Movie
2 How I met your Mother (Avnsitt 3) Series
3 random name Movie
4 Random movie 3 Episode 8383893 Series

How to create a column of strings, including the values from another column

df =
car
big.yellow
small.red
small.black
I want to add each row value between + +. Desired output:
vehicle = 'The vehicle is big.yellow mine'
vehicle = 'The vehicle is small.red mine'
vehicle = 'The vehicle is small.black mine'
I need to merge all these string into 1 big string:
final_vehicle = 'The vehicle is big.yellow mine
The vehicle is small.red mine
The vehicle is small.black mine'
But the number of rows in real data is 1000+. How I can speed up?
A vectorized approach to create a string for each row value is:
How to add string to all values in a column of pandas DataFrame answers the first question, but not the second.
df['col'] = 'string ' + df.car + ' string'
Combine the values into a single long string with one of the following:
pandas.DataFrame.to_string as final = df.veh.to_string(index=False)
str.join() as final = '\n'.join(df.veh.tolist())
import pandas as pd
import string # for test data
import random # for test data
# create test dataframe
random.seed(365)
df = pd.DataFrame({'car': [random.choice(string.ascii_lowercase) for _ in range(10000)]})
# display(df.head())
car
v
j
w
y
e
# add the veh column as strings including the value from the car column
df['veh'] = 'The vehicle is ' + df.car + ' mine'
# display(df.head()
car veh
v The vehicle is v mine
j The vehicle is j mine
w The vehicle is w mine
y The vehicle is y mine
e The vehicle is e mine
# create a long string of all the values in veh
final = df.veh.to_string(index=False)
print(final)
The vehicle is v mine
The vehicle is j mine
The vehicle is w mine
The vehicle is y mine
The vehicle is e mine
...
this code probably is solve the problem:
import pandas as pd
df = pd.DataFrame(columns=['id', 'car'])
df['car'] = ['big.yellow', 'small.red', 'small.black']
df['id'] = [1,1,1]
df['new'] = df.groupby('id')['car'].apply(lambda x: ('The vehicle is '+x + '\n').cumsum().str.strip())
df
Results:
id car new
0 1 big.yellow The vehicle is big.yellow
1 1 small.red The vehicle is big.yellow\nThe vehicle is smal...
2 1 small.black The vehicle is big.yellow\nThe vehicle is smal...
and final:
df['new'][len(df)-1]
is:
'The vehicle is big.yellow\nThe vehicle is small.red\nThe vehicle is small.black'

Pandas map two dataframes using regex

I've two dataframes, one with text information and another with regex and patterns, what I need to do is to map a column from the second dataframe using regex
edit: What I need to do is to apply each regex on all df['text'] rows, and if there is a match, add the Pattern into a new column
Sample data
text_dict = {'text':['customer and increased repair and remodel activity as well as from other sales',
'sales for the overseas customers',
'marketing approach is driving strong play from top tier customers',
'employees in India have been the continuance of remote work will impact productivity',
'sales due to higher customer']}
regex_dict = {'Pattern':['Sales + customer', 'Marketing + customer', 'Employee * Productivity'],
'regex': ['(?:sales\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:sales\\w*)',
'(?:marketing\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:marketing\\w*)',
'(?:employee\\w*)(?:[^\n])*(?:productivity\\w*)|(?:productivity\\w*)(?:[^\n])*(?:employee\\w*)']}
df
text
0 customer and increased repair and remodel acti...
1 sales for the overseas customers
2 marketing approach is driving strong play from...
3 employees in India have been the continuance o...
4 sales due to higher customer
regex
Pattern regex
0 Sales + customer (?:sales\w*)(?:[^,.?])*(?:customer\w*)|(?:cust...
1 Marketing + customer (?:marketing\w*)(?:[^,.?])*(?:customer\w*)|(?:...
2 Employee * Productivity (?:employee\w*)(?:[^\n])*(?:productivity\w*)|(...
Desired output
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer
tried the following, created a function that returns the Pattern in case there is a match, then I iterate over all the columns in the regex dataframe
def finding_keywords(regex, match, keyword):
if re.search(regex, match):
return keyword
else:
pass
for index, row in regex.iterrows():
df['Pattern'] = df['text'].apply(lambda x: finding_keywords(regex['Regex'][index], x, regex['Pattern'][index]))
the problem with this is that in every iteration, it erases the previous mappings, as you can see below. As I'm foo foo was the last iteration, is the only one remaining with a pattern
text Pattern
0 foo None
1 bar None
2 foo foo I'm foo foo
3 foo bar None
4 bar bar None
One solution could be to run the iteration over regex dataframe, and then iterate over df, this way I avoid loosing information, but I'm looking for a fastest solution
You can loop through the unique values of the regex dataframe and apply to the text of the df frame and return the pattern in a new regex column. Then, merge in the Pattern column and drop the regex column.
The key to my approach was to first create the column as NaN and then fillna with each iteration so the columns didn't get overwritten.
import re
import numpy as np
srs = regex['regex'].unique()
df['regex'] = np.nan
for reg in srs:
df['regex'] = df['regex'].fillna(df['text'].apply(lambda x: reg
if re.search(reg, x) else np.NaN))
df = pd.merge(df, regex, how='left', on='regex').drop('regex', axis=1)
df
Out[1]:
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer

Remove entries ending in 'X' from column

I have a column of film titles. Some of these titles include the release date of the film (e.g. 'Toy Story (1995)'), but some do not. I want to delete the entries which DO NOT have a date. I tried to do this by saying "If the last character is not ')', make the entire entry blank." I tried the following code - it didn't give me an error, but it didn't work either:
for i in df['title']:
if i[-1] != ')':
i = ''
For instance, a shorted dataframe might be:
df = pd.DataFrame({'title': ['Toy Story (1995)', 'The Matrix (1999)', 'Jumanji', 'Interstellar (2014)']})
If format of date is just year in brackets at the end of movie title, then try:
import re
df = pd.DataFrame({'movie':['Toy Story (1995)','Toy Story (no date)','Oddyssey 2000', 'Fort 6600', 'The Matrix (1999)', 'Jumanji', 'Interstellar (2014)']})
df:
movie
0 Toy Story (1995)
1 Toy Story (no date)
2 Oddyssey 2000
3 Fort 6600
4 The Matrix (1999)
5 Jumanji
6 Interstellar (2014)
Using regular expression:
df[df.movie.apply(lambda x: bool(re.search('\([1-2][0-9]{3}\)$', x)))]
result:
movie
0 Toy Story (1995)
4 The Matrix (1999)
6 Interstellar (2014)
Numbers that are not years or are not in brackets will not be included in the result. I assumed year must begin with 1 or 2.
i stores only the data, it isn't a reference to the list item.
You can do that with enumerate:
for index, element in enumerate(df['title']):
if element[-1] != ')':
df['title'][index] = ''
It is because the variable i stores a copy of data, not original reference.
So, you should do:
for i in range(len(df['title'])):
if df['title'][i][-1] != ')':
df['title'][i] = ''

How to create a new column in pandas dataframe with different replacement of a part of the string in each row?

I have 3 different columns in different dataframes that look like this.
Column 1 has sentence templates, e.g. "He would like to [action] this week".
Column 2 has pairs of words, e.g. "exercise, swim".
The 3d column has the type for the word pair, e.g. [action].
I assume there should be something similar to "melt" in R, but I'm not sure how to do the replacement.
I would like to create a new column/dataframe which will have all the possible options for each sentence template (one sentence per row):
He would like to exercise this week.
He would like to swim this week.
The number of templates is significantly lower than the number of words I have. There are several types of word pairs (action, description, object, etc).
#a simple example of what I would like to achieve
import pandas as pd
#input1
templates = pd.DataFrame(columns=list('AB'))
templates.loc[0] = [1,'He wants to [action] this week']
templates.loc[1] = [2,'She noticed a(n) [object] in the distance']
templates
#input 2
words = pd.DataFrame(columns=list('AB'))
words.loc[0] = ['exercise, swim', 'action']
words.loc[1] = ['bus, shop', 'object']
words
#output
result = pd.DataFrame(columns=list('AB'))
result.loc[0] = [1, 'He wants to exercise this week']
result.loc[1] = [2, 'He wants to swim this week']
result.loc[2] = [3, 'She noticed a(n) bus in the distance']
result.loc[3] = [4, 'She noticed a(n) shop in the distance']
result
First create new columns by Series.str.extract with words from words['B'] and then Series.map for values for replacement:
pat = '|'.join(r"\[{}\]".format(re.escape(x)) for x in words['B'])
templates['matched'] = templates['B'].str.extract('('+ pat + ')', expand=False).fillna('')
templates['repl'] =(templates['matched'].map(words.set_index('B')['A']
.rename(lambda x: '[' + x + ']'))).fillna('')
print (templates)
A B matched repl
0 1 He wants to [action] this week [action] exercise, swim
1 2 She noticed a(n) [object] in the distance [object] bus, shop
And then replace in list comprehension:
z = zip(templates['B'],templates['repl'], templates['matched'])
result = pd.DataFrame({'B':[a.replace(c, y) for a,b,c in z for y in b.split(', ')]})
result.insert(0, 'A', result.index + 1)
print (result)
A B
0 1 He wants to exercise this week
1 2 He wants to swim this week
2 3 She noticed a(n) bus in the distance
3 4 She noticed a(n) shop in the distance

Categories