Count sub word frequency in pandas DataFrame - python

I have a pandas.DataFrame with 2 columns that contains the type of alcohol (i.e VODKA 80 PROOF, CANADIAN WHISKIES, SPICED RUM) and the number of bottles sold. I would like to first categorize it in categories that are less granular i.e (WHISKEY, VODKA, RUM) and then sum all bottles sold per category.
My code does not allow me to isolate tags such as "VODKA" but instead returns the frequency of categories such "VODKA 80 Proof".
In:
top_N = 10 # top 10 most used categories
word_dist = nltk.FreqDist(df['Category Name'])
print('All frequencies:')
print('=' * 60)
rslt = pd.DataFrame(word_dist.most_common(top_N),
columns=['Word', 'Frequency'])
print(rslt)
print('=' * 60)
df= df.groupby('Category Name')['Bottles Sold'].sum()
Out:
All frequencies:
============================================================
Word Frequency
0 VODKA 80 PROOF 35373
1 CANADIAN WHISKIES 27087
2 STRAIGHT BOURBON WHISKIES 15342
3 SPICED RUM 14631
4 VODKA FLAVORED 14001
5 TEQUILA 12109
6 BLENDED WHISKIES 11547
7 WHISKEY LIQUEUR 10902
8 IMPORTED VODKA 10668
9 PUERTO RICO & VIRGIN ISLANDS RUM 10062
============================================================
Any thoughts?

Have you considered adding categories of matched words? Something like:
Code:
categories = {'VODKA', 'WHISKIES', 'RUM', 'TEQUILA', 'LIQUEUR'}
df['category'] = df['product'].apply(lambda x:
[c for c in categories if c in x]
Test Code:
data = [
['VODKA 80 PROOF', '35373'],
['CANADIAN WHISKIES', '27087'],
['STRAIGHT BOURBON WHISKIES', '15342'],
['SPICED RUM', '14631'],
['VODKA FLAVORED', '14001'],
['TEQUILA', '12109'],
['BLENDED WHISKIES', '11547'],
['WHISKEY LIQUEUR', '10902'],
['IMPORTED VODKA', '10668'],
['PUERTO RICO & VIRGIN ISLANDS RUM', '10062'],
]
df = pd.DataFrame(data, columns=['product', 'count'], dtype=int)
categories = {'VODKA', 'WHISKIES', 'RUM', 'TEQUILA', 'LIQUEUR'}
df['category'] = df['product'].apply(lambda x:
[c for c in categories if c in x][0])
print(df)
print(df.groupby('category')['count'].sum())
Results:
product count category
0 VODKA 80 PROOF 35373 VODKA
1 CANADIAN WHISKIES 27087 WHISKIES
2 STRAIGHT BOURBON WHISKIES 15342 WHISKIES
3 SPICED RUM 14631 RUM
4 VODKA FLAVORED 14001 VODKA
5 TEQUILA 12109 TEQUILA
6 BLENDED WHISKIES 11547 WHISKIES
7 WHISKEY LIQUEUR 10902 LIQUEUR
8 IMPORTED VODKA 10668 VODKA
9 PUERTO RICO & VIRGIN ISLANDS RUM 10062 RUM
category
LIQUEUR 10902
RUM 24693
TEQUILA 12109
VODKA 60042
WHISKIES 53976
Name: count, dtype: int32

Thanks for your input Stephen!
I've done a slight modification cause your answer was returning me an KeyError.
Here is my modification:
def search_brand(x):
x = str(x)
list_brand = ['VODKA','WHISKIES', 'RUM', 'TEQUILA', 'LIQUEUR', 'BRANDIES', 'COCKTAILS']
for word in list_brand:
if word in x:
return word
df['Broad Category'] = df['Category Name'].apply(search_brand)

Related

If column contains substring from list, create new column with removed substring from list

I'm trying to create a simplified name column. I have a brand name column and a list of strings as shown below. If the brand name column contains any string from list, then create a simplified brand name column with the string matched removed. The other brand name column elements that do not contain any strings from list will be carried over to the simplified column
l = ['co', 'ltd', 'company']
df:
Brand
Nike
Adidas co
Apple company
Intel
Google ltd
Walmart co
Burger King
Desired df:
Brand Simplified
Nike Nike
Adidas co Adidas
Apple company Apple
Intel Intel
Google Ltd Google
Walmart co Walmart
Burger King Burger King
Thanks in advance! Any help is appreciated!!
how about use this to remove substrings and trailing whitespaces
list_substring = ['ltd', 'company', 'co'] # 'company' will be evaluated first before 'co'
df['Simplified'] = df['Brand'].str.replace('|'.join(list_substring), '').str.lstrip()
In [28]: df
Out[28]:
Brand
0 Nike
1 Adidas co
2 Apple company
3 Intel
4 Google ltd
5 Walmart co
6 Burger King
In [30]: df["Simplified"] = df.Brand.apply(lambda x: x.split()[0] if x.split()[-1] in l else x)
In [31]: df
Out[31]:
Brand Simplified
0 Nike Nike
1 Adidas co Adidas
2 Apple company Apple
3 Intel Intel
4 Google ltd Google
5 Walmart co Walmart
6 Burger King Burger King
Using str.replace
Ex:
l = ['co', 'ltd', 'company']
df = pd.DataFrame({'Brand': ['Nike', 'Adidas co', 'Apple company', 'Intel', 'Google ltd', 'Walmart co', 'Burger King']})
df['Simplified'] = df['Brand'].str.replace(r"\b(" + "|".join(l) + r")\b", "").str.strip()
#or df['Brand'].str.replace(r"\b(" + "|".join(l) + r")\b$", "").str.strip() #TO remove only in END of string
print(df)
Output:
Brand Simplified
0 Nike Nike
1 Adidas co Adidas
2 Apple company Apple
3 Intel Intel
4 Google ltd Google
5 Walmart co Walmart
6 Burger King Burger King
df = {"Brand":["Nike","Adidas co","Apple company","Google ltd","Berger King"]}
df = pd.DataFrame(df)
list_items = ['ltd', 'company', 'co'] # 'company' will be evaluated first before 'co'
df['Simplified'] = [' '.join(w) for w in df['Brand'].str.split().apply(lambda x: [i for i in x if i not in list_items])]

Populate value for data frame row based on condition

Background
I have a dataset that looks like the following:
product_name price
Women's pant 20.00
Men's Shirt 30.00
Women's Dress 40.00
Blue Shirt 30.00
...
I am looking to create a new column called
gender
which will contain the values Women,Men, or Unisex based in the string in the product_name
The desired result would look like this:
product_name price gender
Women's pant 20.00 women
Men's Shirt 30.00 men
Women's Dress 40.00 women
Blue Shirt 30.00 unisex
My Approach
I figured that first I should create a new column with a blank value for each row. Then I should loop through each row in the dataframe and check on the string df[product_name] to see if its a mens, womens, or unisex and fill out the respective gender row value.
Here is my code:
df['gender'] = ""
for product_name in df['product_name']:
if 'women' in product_name.lower():
df['gender'] = 'women'
elif 'men' in product_name.lower():
df['gender'] = 'men'
else:
df['gender'] = 'unisex'
However, I get the following result:
product_name price gender
Women's pant 20.00 men
Men's Shirt 30.00 men
Women's Dress 40.00 men
Blue Shirt 30.00 men
I would really appreciate some help here as I am new to python and pandas library.
You could use a list comprehension with if/else to get your output:
df['gender'] = ['women' if 'women' in word
else "men" if "men" in word
else "unisex"
for word in df.product_name.str.lower()]
df
product_name price gender
0 Women's pant 20.0 women
1 Men's Shirt 30.0 men
2 Women's Dress 40.0 women
3 Blue Shirt 30.0 unisex
Alternatively, you could use numpy select to achieve the same results:
cond1 = df.product_name.str.lower().str.contains("women")
cond2 = df.product_name.str.lower().str.contains("men")
condlist = [cond1, cond2]
choicelist = ["women", "men"]
df["gender"] = np.select(condlist, choicelist, default="unisex")
Usually, for strings, python's iteration is much faster; you have to test that though.
Try turning your for statement into a function and using apply. So something like -
def label_gender(product_name):
'''product_name is a str'''
if 'women' in product_name.lower():
return 'women'
elif 'men' in product_name.lower():
return 'men'
else:
return 'unisex'
df['gender'] = df.apply(lambda x: label_gender(x['product_name']),axis=1)
A good breakdown of using apply/lambda can be found here: https://towardsdatascience.com/apply-and-lambda-usage-in-pandas-b13a1ea037f7
You can also use np.where + Series.str.contains,
import numpy as np
df['gender'] = (
np.where(df.product_name.str.contains("women", case=False), 'women',
np.where(df.product_name.str.contains("men", case=False), "men", 'unisex'))
)
product_name price gender
0 Women's pant 20.0 women
1 Men's Shirt 30.0 men
2 Women's Dress 40.0 women
3 Blue Shirt 30.0 unisex
Use np.where .str.contains and regex first word` in phrase. So that;
#np.where(if product_name has WomenORMen, 1st Word in Phrase, otherwise;unisex)
df['Gender']=np.where(df.product_name.str.contains('Women|Men')\
,df.product_name.str.split('(^[\w]+)').str[1],'Unisex')
product_name price gender
0 Women's pant 20.0 Women
1 Men's Shirt 30.0 Men
2 Women's Dress 640.0 Women
3 Blue Shirt 30.0 Unisex

Filter data where date is within +/-30 days of multiple given dates

I have a dataset where each observation has a Date. Then I have a list of events. I want to filter the dataset and keep observations only if the date is within +/- 30 days of an event. Also, I want to know which event it is closest to.
For example, the main dataset looks like:
Product Date
Chicken 2008-09-08
Pork 2008-08-22
Beef 2008-08-15
Rice 2008-07-22
Coke 2008-04-05
Cereal 2008-04-03
Apple 2008-04-02
Banana 2008-04-01
It is generated by
d = {'Product': ['Apple', 'Banana', 'Cereal', 'Coke', 'Rice', 'Beef', 'Pork', 'Chicken'],
'Date': ['2008-04-02', '2008-04-01', '2008-04-03', '2008-04-05',
'2008-07-22', '2008-08-15', '2008-08-22', '2008-09-08']}
df = pd.DataFrame(data = d)
df['Date'] = pd.to_datetime(df['Date'])
Then I have a column of events:
Date
2008-05-03
2008-07-20
2008-09-01
generated by
event = pd.DataFrame({'Date': pd.to_datetime(['2008-05-03', '2008-07-20', '2008-09-01'])})
GOAL (EDITED)
I want to keep the rows in df only if df['Date'] is within a month of event['Date']. For example, the first event occurred on 2008-05-03, so I want to keep observations between 2008-04-03 and 2008-06-03, and also create a new column to tell this observation is closest to the event on 2008-05-03.
Product Date Event
Chicken 2008-09-08 2008-09-01
Pork 2008-08-22 2008-09-01
Beef 2008-08-15 2008-07-20
Rice 2008-07-22 2008-07-20
Coke 2008-04-05 2008-05-03
Cereal 2008-04-03 2008-05-03
Use numpy broadcast and assumed within 30 days
df[np.any(np.abs(df.Date.values[:,None]-event.Date.values)/np.timedelta64(1,'D')<31,1)]
Out[90]:
Product Date
0 Chicken 2008-09-08
1 Pork 2008-08-22
2 Beef 2008-08-15
3 Rice 2008-07-22
4 Coke 2008-04-05
5 Cereal 2008-04-03
event['eDate'] = event.Date
df = pd.merge_asof(df.sort_values('Date'), event.sort_values('Date'), on="Date", direction='nearest')
df[(df.Date - df.eDate).abs() <= '30 days']
I would use listcomp with intervalindex
ms = pd.offsets.MonthOffset(1)
e1 = event.Date - ms
e2 = event.Date + ms
iix = pd.IntervalIndex.from_arrays(e1, e2, closed='both')
df.loc[[any(d in i for i in iix) for d in df.Date]]
Out[93]:
Product Date
2 Cereal 2008-04-03
3 Coke 2008-04-05
4 Rice 2008-07-22
5 Beef 2008-08-15
6 Pork 2008-08-22
7 Chicken 2008-09-08
If it just months irrespective of dates, this may be useful.
rng=[]
for a, b in zip (event['Date'].dt.month-1, event['Date'].dt.month+1):
rng = rng + list(range(a-1,b+1,1))
df[df['Date'].dt.month.isin(set(rng))]

Dealing with abbreviation and misspelled words in DataFrame Pandas

I have a dataframe contains misspelled words and abbreviations like this.
input:
df = pd.DataFrame(['swtch', 'cola', 'FBI',
'smsng', 'BCA', 'MIB'], columns=['misspelled'])
output:
misspelled
0 swtch
1 cola
2 FBI
3 smsng
4 BCA
5 MIB
I need to correcting the misspelled words and the Abvreviations
I have tried with creating the dictionary such as:
input:
dicts = pd.DataFrame(['coca cola', 'Federal Bureau of Investigation',
'samsung', 'Bank Central Asia', 'switch', 'Men In Black'], columns=['words'])
output:
words
0 coca cola
1 Federal Bureau of Investigation
2 samsung
3 Bank Central Asia
4 switch
5 Men In Black
and applying this code
x = [next(iter(x), np.nan) for x in map(lambda x: difflib.get_close_matches(x, dicts.words), df.misspelled)]
df['fix'] = x
print (df)
The output is I have succeded correcting misspelled but not the abbreviation
misspelled fix
0 swtch switch
1 cola coca cola
2 FBI NaN
3 smsng samsung
4 BCA NaN
5 MIB NaN
Please help.
How about following a 2-prong approach where first correct the misspellings and then expand the abbreviations:
df = pd.DataFrame(['swtch', 'cola', 'FBI', 'smsng', 'BCA', 'MIB'], columns=['misspelled'])
abbreviations = {
'FBI': 'Federal Bureau of Investigation',
'BCA': 'Bank Central Asia',
'MIB': 'Men In Black',
'cola': 'Coca Cola'
}
spell = SpellChecker()
df['fixed'] = df['misspelled'].apply(spell.correction).replace(abbreviations)
Result:
misspelled fixed
0 swtch switch
1 cola Coca Cola
2 FBI Federal Bureau of Investigation
3 smsng among
4 BCA Bank Central Asia
5 MIB Men In Black
I use pyspellchecker but you can go with any spelling-checking library. It corrected smsng to among but that is a caveat of automatic spelling correction. Different libraries may give out different results

Assign specific nominal values randomly to rows using pandas

I want to assign some selected nominal values randomly to rows. For example:
I have three nominal values ["apple", "orange", "banana"].
Before assign these values randomly to rows:
**Name Fruit**
Jack
Julie
Juana
Jenny
Christina
Dickens
Robert
Cersei
After assign these values randomly to rows:
**Name Fruit**
Jack Apple
Julie Orange
Juana Apple
Jenny Banana
Christina Orange
Dickens Orange
Robert Apple
Cersei Banana
How can I do this using pandas dataframe?
You can use pd.np.random.choice with your values:
vals = ["apple", "orange", "banana"]
df['Fruit'] = pd.np.random.choice(vals, len(df))
>>> df
Name Fruit
0 Jack apple
1 Julie orange
2 Juana apple
3 Jenny orange
4 Christina apple
5 Dickens banana
6 Robert orange
7 Cersei orange
You can create a DataFrame in pandas and then assign random choices using numpy
ex2 = pd.DataFrame({'Name':['Jack','Julie','Juana','Jenny','Christina','Dickens','Robert','Cersei']})
ex2['Fruits'] = np.random.choice(['Apple','Orange','Banana'],ex2.shape[0])

Categories