Background
I have a dataset that looks like the following:
product_name price
Women's pant 20.00
Men's Shirt 30.00
Women's Dress 40.00
Blue Shirt 30.00
...
I am looking to create a new column called
gender
which will contain the values Women,Men, or Unisex based in the string in the product_name
The desired result would look like this:
product_name price gender
Women's pant 20.00 women
Men's Shirt 30.00 men
Women's Dress 40.00 women
Blue Shirt 30.00 unisex
My Approach
I figured that first I should create a new column with a blank value for each row. Then I should loop through each row in the dataframe and check on the string df[product_name] to see if its a mens, womens, or unisex and fill out the respective gender row value.
Here is my code:
df['gender'] = ""
for product_name in df['product_name']:
if 'women' in product_name.lower():
df['gender'] = 'women'
elif 'men' in product_name.lower():
df['gender'] = 'men'
else:
df['gender'] = 'unisex'
However, I get the following result:
product_name price gender
Women's pant 20.00 men
Men's Shirt 30.00 men
Women's Dress 40.00 men
Blue Shirt 30.00 men
I would really appreciate some help here as I am new to python and pandas library.
You could use a list comprehension with if/else to get your output:
df['gender'] = ['women' if 'women' in word
else "men" if "men" in word
else "unisex"
for word in df.product_name.str.lower()]
df
product_name price gender
0 Women's pant 20.0 women
1 Men's Shirt 30.0 men
2 Women's Dress 40.0 women
3 Blue Shirt 30.0 unisex
Alternatively, you could use numpy select to achieve the same results:
cond1 = df.product_name.str.lower().str.contains("women")
cond2 = df.product_name.str.lower().str.contains("men")
condlist = [cond1, cond2]
choicelist = ["women", "men"]
df["gender"] = np.select(condlist, choicelist, default="unisex")
Usually, for strings, python's iteration is much faster; you have to test that though.
Try turning your for statement into a function and using apply. So something like -
def label_gender(product_name):
'''product_name is a str'''
if 'women' in product_name.lower():
return 'women'
elif 'men' in product_name.lower():
return 'men'
else:
return 'unisex'
df['gender'] = df.apply(lambda x: label_gender(x['product_name']),axis=1)
A good breakdown of using apply/lambda can be found here: https://towardsdatascience.com/apply-and-lambda-usage-in-pandas-b13a1ea037f7
You can also use np.where + Series.str.contains,
import numpy as np
df['gender'] = (
np.where(df.product_name.str.contains("women", case=False), 'women',
np.where(df.product_name.str.contains("men", case=False), "men", 'unisex'))
)
product_name price gender
0 Women's pant 20.0 women
1 Men's Shirt 30.0 men
2 Women's Dress 40.0 women
3 Blue Shirt 30.0 unisex
Use np.where .str.contains and regex first word` in phrase. So that;
#np.where(if product_name has WomenORMen, 1st Word in Phrase, otherwise;unisex)
df['Gender']=np.where(df.product_name.str.contains('Women|Men')\
,df.product_name.str.split('(^[\w]+)').str[1],'Unisex')
product_name price gender
0 Women's pant 20.0 Women
1 Men's Shirt 30.0 Men
2 Women's Dress 640.0 Women
3 Blue Shirt 30.0 Unisex
Related
assets = [[['Ferrari', 'BMW', 'Suzuki'], ['Ducati', 'Honda']], [['Apple', 'Samsung', 'Oppo']]]
price = [[[853600, 462300, 118900], [96500, 16700]], [[1260, 750, 340]]]
I have a dataframe as follows :
Car
Bike
Phone
BMW
Ducati
Apple
Ferrari
Honda
Oppo
Looking for code to get the Total_Cost , i.e 462300 + 96500 + 1260 = 560060
Car
Bike
Phone
Total Cost
BMW
Ducati
Apple
560060
Ferrari
Honda
Oppo
870640
I tried the for loop and succeeded, I want the advanced code if any.
Here is a possible solution:
df = pd.DataFrame({'Car': ['BMW', 'Ferrari'], 'Bike': ['Ducati', 'Honda'], 'Phone': ['Apple', 'Oppo']})
asset_price = {asset: price[a][b][c]
for a, asset_list in enumerate(assets)
for b, asset_sub_list in enumerate(asset_list)
for c, asset in enumerate(asset_sub_list)
}
df['Total_Cost'] = df.apply(lambda row: sum([asset_price[asset] for asset in row]), axis=1)
print(df)
Car Bike Phone Total_Cost
0 BMW Ducati Apple 560060
1 Ferrari Honda Oppo 870640
You can also use numpy approach import numpy as np depending on your use-case. But I will suggest the first approach which is more simple and easy to understand.
df = pd.DataFrame({'Car': ['BMW', 'Ferrari'], 'Bike': ['Ducati', 'Honda'], 'Phone': ['Apple', 'Oppo']})
flat_assets = np.concatenate([np.concatenate(row) for row in assets])
flat_price = np.concatenate([np.concatenate(row) for row in price])
asset_dict = dict(zip(flat_assets, flat_price))
asset_prices = np.array([asset_dict[row] for row in df.values.flatten()
if row in asset_dict])
df['Total Cost'] = np.sum(asset_prices.reshape(-1, 3), axis=1)
print(df)
Car Bike Phone Total Cost
0 BMW Ducati Apple 560060
1 Ferrari Honda Oppo 870640
An alternative approach:
First build a dataframe df_price which maps prices onto the assets and the classification (Car, Bike, and Phone):
df_price = (
pd.DataFrame({"assets": assets, "price": price}).explode(["assets", "price"])
.assign(cols=["Car", "Bike", "Phone"]).explode(["assets", "price"])
)
Result:
assets price cols
0 Ferrari 853600 Car
0 BMW 462300 Car
0 Suzuki 118900 Car
0 Ducati 96500 Bike
0 Honda 16700 Bike
1 Apple 1260 Phone
1 Samsung 750 Phone
1 Oppo 340 Phone
(I have inserted the classification here due to the comment on the other answer: "... But if the nested lists of asset is having common name (say : Honda in place if Suzuki ) then Honda car and Honda Bike will take one price".
Then join the prices onto the .melted main dataframe df, .pivot (using the auxilliary column idx), sum up the prices in the rows, and bring the result in shape.
res = (
df.melt(var_name="cols", value_name="assets", ignore_index=False)
.merge(df_price, on=["cols", "assets"])
.assign(idx=lambda df: df.groupby("cols").cumcount())
.pivot(index="idx", columns="cols")
.assign(total=lambda df: df.loc[:, "price"].sum(axis=1))
.loc[:, ["assets", "total"]]
.droplevel(0, axis=1).rename(columns={"": "Total_Costs"})
)
Result:
cols Bike Car Phone Total_Costs
idx
0 Ducati BMW Apple 560060.0
1 Honda Ferrari Oppo 870640.0
Background
I have a dataset where I have the following:
product_title price
Women's Pant 20.00
Men's Shirt 30.00
Women's Dress 40.00
Blue 4" Shorts 30.00
Blue Shorts 35.00
Green 2" Shorts 30.00
I created a new column called gender which contains the values Women, Men, or Unisex based on the specified string in product_title.
The output looks like this:
product_title price gender
Women's Pant 20.00 women
Men's Shirt 30.00 men
Women's Dress 40.00 women
Blue 4" Shorts 30.00 women
Blue Shorts 35.00 unisex
Green 2" Shorts 30.00 women
Approach
I approached creating a new column by using if/else statements:
df['gender'] = ['women' if 'women' in word or 'Blue 4"' in word or 'Green 2"' in word
else "men" if "men" in word
else "unisex"
for word in df.product_title.str.lower()]
Although this approach works, it becomes very long when I have a lot of conditions for labeling women vs men vs unisex. Is there cleaner way to do this? Is there a way I can pass a list of strings instead of having a long chain of or conditions?
I would really appreciate help as I am new to python and pandas library.
IIUC,
import numpy as np
s = df['product title'].str.lower()
df['gender'] = np.select([s.str.contains('men'),
s.str.contains('women|blue 4 shorts|green 2 shorts')],
['men', 'women'],
default='unisex')
Here is another idea with str.extract and series.map
d = {'women':['women','blue 4"','green 2"'],'men':['men']}
d1 = {val:k for k,v in d.items() for val in v}
pat = '|'.join(d1.keys())
import re
df['gender'] = (df['product_title'].str.extract('('+pat+')',flags=re.I,expand=False)
.str.lower().map(d1).fillna('unisex'))
print(df)
product_title price gender
0 Women's Pant 20.0 women
1 Men's Shirt 30.0 men
2 Women's Dress 40.0 women
3 Blue 4" Shorts 30.0 women
4 Blue Shorts 35.0 unisex
5 Green 2" Shorts 30.00 NaN women
You can try to define your own function and run it with a apply+lambda espression:
Create the function which you can change as you need:
def sex(str):
'''
look for specific values and retun value
'''
for words in ['women','Blue 4"','Green 2"']:
if words in str.lower():
return 'women'
elif 'men' in str.lower():
return 'men'
else:
return 'unisex'
and after apply to the colum you need to check for values:
df['gender']=df['product_title'].apply(lambda str: sex(str))
Cheers!
EDIT 3:
After looking around and checking about the numpy approac from #ansev following #anky comment I was able to find out this may be faster up to a certain point, tested with 5000 rows and still faster, but the numpy approach started to catch up. So it really depends on how big your dataset are.
Will remove any comment on speed considered I was testing only on this small frame initially, still a learning process as you can see from my level.
I'm trying to make a function that prints the top 5 products and their prices, and the bottom 5 products and their prices of the product listings that contain words from a wordlist. I've tried making it like this -
def wordlist_top_costs(filename, wordlist):
xlsfile = pd.ExcelFile(filename)
dframe = xlsfile.parse('Sheet1')
dframe['Product'].fillna('', inplace=True)
dframe['Price'].fillna(0, inplace=True)
price = {}
for word in wordlist:
mask = dframe.Product.str.contains(word, case=False, na=False)
price[mask] = dframe.loc[mask, 'Price']
top = sorted(Score.items(), key=operator.itemgetter(1), reverse=True)
print("Top 10 product prices for: ", wordlist.name)
for i in range(0, 5):
print(top[i][0], " | ", t[i][1])
bottom = sorted(Score.items(), key=operator.itemgetter(1), reverse=False)
print("Bottom 10 product prices for: ", wordlist.name)
for i in range(0, 5):
print(top[i][0], " | ", t[i][1])
However, the above function throws an error at line
price[mask] = dframe.loc[mask, 'Price in AUD'] that says -
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Any help to correct/modify this appreciated. Thanks!
Edit -
For eg.
wordlist - alu, co, vin
Product | Price
Aluminium Crown - 22.20
Coca Cola - 1.0
Brass Box - 28.75
Vincent Kettle - 12.00
Vinyl Stickers - 0.50
Doritos - 2.0
Colin's Hair Oil - 5.0
Vincent Chase Sunglasses - 75.40
American Tourister - $120.90
Output :
Top 3 Product Prices:
Vincent Chase Sunglasses - 75.40
Aluminium Crown - 22.20
Vincent Kettle - 12.0
Bottom 3 Product Prices:
Vinyl Stickers - 0.50
Coca Cola - 1.0
Colin's Hair Oil - 5.0
You can use nlargest and
nsmallest:
#remove $ and convert column Price to floats
dframe['Price'] = dframe['Price'].str.replace('$', '').astype(float)
#filter by regex - joined all values of list by |
wordlist = ['alu', 'co', 'vin']
pat = '|'.join(wordlist)
mask = dframe.Product.str.contains(pat, case=False, na=False)
dframe = dframe.loc[mask, ['Product','Price']]
top = dframe.nlargest(3, 'Price')
#top = dframe.sort_values('Price', ascending=False).head(3)
print (top)
Product Price
7 Vincent Chase Sunglasses 75.4
0 Aluminium Crown 22.2
3 Vincent Kettle 12.0
bottom = dframe.nsmallest(3, 'Price')
#bottom = dframe.sort_values('Price').head(3)
print (bottom)
Product Price
4 Vinyl Stickers 0.5
1 Coca Cola 1.0
6 Colin's Hair Oil 5.0
Setup:
dframe = pd.DataFrame({'Price': ['22.20', '1.0', '28.75', '12.00', '0.50', '2.0', '5.0', '75.40', '$120.90'], 'Product': ['Aluminium Crown', 'Coca Cola', 'Brass Box', 'Vincent Kettle', 'Vinyl Stickers', 'Doritos', "Colin's Hair Oil", 'Vincent Chase Sunglasses', 'American Tourister']}, columns=['Product','Price'])
print (dframe)
Product Price
0 Aluminium Crown 22.20
1 Coca Cola 1.0
2 Brass Box 28.75
3 Vincent Kettle 12.00
4 Vinyl Stickers 0.50
5 Doritos 2.0
6 Colin's Hair Oil 5.0
7 Vincent Chase Sunglasses 75.40
8 American Tourister $120.90
I have a large string which I have to transfer into a data frame. For example the string is:
meals_string = "APPETIZERS Southern Fried Quail with
Greens,Huckleberries,Pecans & Blue Cheese 14.00 Park Avenue Cafe
Chopped Salad Goat Feta Cheese,Nigoise Olives,Marinated White [...]
ENTREES Horseradish Crusted Canadian Salmon,Potato Fritters, Marinated
Cucumbers,Chive Vinaigrette 27.00 Sautéed Prawns with Mushroom
Tortellini,Grilled Tomato Vinaigrette & Sweet Corn 29.50"
meals = meals_string.splitlines()
Which gives me var "meals" as list, but I am stuck how to convert the string into dataframe with 3 columns: Category; Meal_name; Price
A relatively simple parser for your string can be built and the passed directly to pandas.DataFrame like:
Code:
def meal_string_parser(meal_string):
category = ''
meal = []
price = 0
for word in meal_string.split():
if word:
try:
price = float(word)
yield category, ' '.join(meal), price
meal = []
except ValueError:
# this is not a number, so not a price
if word.upper() == word and word.isalnum():
# found category
category = word
else:
meal.append(word)
if meal:
yield category, ' '.join(meal), price
Test Code:
meals_string = """
APPETIZERS
Southern Fried Quail with Greens,Huckleberries,Pecans & Blue Cheese 14.00
Park Avenue Cafe Chopped Salad Goat Feta Cheese,Nigoise Olives,Marinated White 13.00
ENTREES
Horseradish Crusted Canadian Salmon,Potato Fritters, Marinated Cucumbers,Chive Vinaigrette 27.00
Sautéed Prawns with Mushroom Tortellini,Grilled Tomato Vinaigrette & Sweet Corn 29.50
"""
import pandas as pd
df = pd.DataFrame(meal_string_parser(meals_string),
columns='Category Meal_name Price'.split())
print(df)
Results:
Category Meal_name Price
0 APPETIZERS Southern Fried Quail with Greens,Huckleberries... 14.0
1 APPETIZERS Park Avenue Cafe Chopped Salad Goat Feta Chees... 13.0
2 ENTREES Horseradish Crusted Canadian Salmon,Potato Fri... 27.0
3 ENTREES Sautéed Prawns with Mushroom Tortellini,Grille... 29.5
I have a pandas.DataFrame with 2 columns that contains the type of alcohol (i.e VODKA 80 PROOF, CANADIAN WHISKIES, SPICED RUM) and the number of bottles sold. I would like to first categorize it in categories that are less granular i.e (WHISKEY, VODKA, RUM) and then sum all bottles sold per category.
My code does not allow me to isolate tags such as "VODKA" but instead returns the frequency of categories such "VODKA 80 Proof".
In:
top_N = 10 # top 10 most used categories
word_dist = nltk.FreqDist(df['Category Name'])
print('All frequencies:')
print('=' * 60)
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
print('=' * 60)
df= df.groupby('Category Name')['Bottles Sold'].sum()
Out:
All frequencies:
============================================================
Word Frequency
0 VODKA 80 PROOF 35373
1 CANADIAN WHISKIES 27087
2 STRAIGHT BOURBON WHISKIES 15342
3 SPICED RUM 14631
4 VODKA FLAVORED 14001
5 TEQUILA 12109
6 BLENDED WHISKIES 11547
7 WHISKEY LIQUEUR 10902
8 IMPORTED VODKA 10668
9 PUERTO RICO & VIRGIN ISLANDS RUM 10062
============================================================
Any thoughts?
Have you considered adding categories of matched words? Something like:
Code:
categories = {'VODKA', 'WHISKIES', 'RUM', 'TEQUILA', 'LIQUEUR'}
df['category'] = df['product'].apply(lambda x:
[c for c in categories if c in x]
Test Code:
data = [
['VODKA 80 PROOF', '35373'],
['CANADIAN WHISKIES', '27087'],
['STRAIGHT BOURBON WHISKIES', '15342'],
['SPICED RUM', '14631'],
['VODKA FLAVORED', '14001'],
['TEQUILA', '12109'],
['BLENDED WHISKIES', '11547'],
['WHISKEY LIQUEUR', '10902'],
['IMPORTED VODKA', '10668'],
['PUERTO RICO & VIRGIN ISLANDS RUM', '10062'],
]
df = pd.DataFrame(data, columns=['product', 'count'], dtype=int)
categories = {'VODKA', 'WHISKIES', 'RUM', 'TEQUILA', 'LIQUEUR'}
df['category'] = df['product'].apply(lambda x:
[c for c in categories if c in x][0])
print(df)
print(df.groupby('category')['count'].sum())
Results:
product count category
0 VODKA 80 PROOF 35373 VODKA
1 CANADIAN WHISKIES 27087 WHISKIES
2 STRAIGHT BOURBON WHISKIES 15342 WHISKIES
3 SPICED RUM 14631 RUM
4 VODKA FLAVORED 14001 VODKA
5 TEQUILA 12109 TEQUILA
6 BLENDED WHISKIES 11547 WHISKIES
7 WHISKEY LIQUEUR 10902 LIQUEUR
8 IMPORTED VODKA 10668 VODKA
9 PUERTO RICO & VIRGIN ISLANDS RUM 10062 RUM
category
LIQUEUR 10902
RUM 24693
TEQUILA 12109
VODKA 60042
WHISKIES 53976
Name: count, dtype: int32
Thanks for your input Stephen!
I've done a slight modification cause your answer was returning me an KeyError.
Here is my modification:
def search_brand(x):
x = str(x)
list_brand = ['VODKA','WHISKIES', 'RUM', 'TEQUILA', 'LIQUEUR', 'BRANDIES', 'COCKTAILS']
for word in list_brand:
if word in x:
return word
df['Broad Category'] = df['Category Name'].apply(search_brand)