Building word Thesaurus in Python - python

I have a list of words that were inputted by my users after I did some cleaning up (to correct spelling mistakes) I have the following list, each row represents a string and the number of times this string was inputted:
Pepsi 500
Coke 358
Dr. pepper 254
Sprite 204
Coca cola 159
7 up 140
Mountain dew 137
Diet coke 58
Mtn. dew 50
Now I would like to have a script that will go over this list and group similar words.
For example, merging Coke, Coca cola and Diet coke into one group (because they are synonyms of Coca cola).
I saw that in NLTK WordNet there are some similarity functions, can I use them? or is there a "better" way of approaching this problem?

Related

Iv been trying to figure this out for way too long

i can only use np and bpd.
Question 1.3
1 point
The 'Segment_Category' describes the food and service of each chain. What are the most popular segment categories in chains?
Create an array called ordered_segment_categories containing all the segment categories, ordered from the most popular segment category to the least popular segment category in chains.
This is the dataframe.
Rank Restaurant Sales YOY_Sales Segment_Category
0 1 McDonald's 40517 0.30% Quick Service & Burger
1 2 Starbucks 18485 -13.50% Quick Service & Coffee Cafe
2 3 Chick-fil-A 13745 13.00% Quick Service & Chicken
3 4 Taco Bell 11294 0.00% Quick Service & Mexican
4 5 Wendy's 10231 4.80% Quick Service & Burger
... ... ... ... ... ...
245 246 American Deli 98 2.00% Quick Service & Sandwich
246 247 Bonchon 98 -19.50% Casual Dining & Asian
247 248 Chopt 98 -12.40% Fast Casual & All Other
248 249 Chicken Express 96 -6.50% Quick Service & Chicken
249 250 Sizzler 96 -63.00% Quick Service & All Other
I tried it all. I just cant seem to figure it out.

Pandas Create a new Column which takes the Most Frequent item Description given Item Codes

I have a dataframe that looks something like this:
Group
UPC
Description
246
1234568
Chips BBQ
158
7532168
Cereal Honey
246
9876532
Chips Ketchup
665
8523687
Strawberry Jam
246
1234568
Chips BBQ
158
5553215
Cereal Chocolate
I want to replace the descriptions of the items with the most frequent description based on the group # or the first instance if there is a tie.
So in the example above: Chips Ketchup (1 instance) is replaced with Chips BBQ (2 instances) And Cereal Chocolate is replaced with Cereal Honey (First Instance).
Desired output would be:
Group
UPC
Description
246
1234568
Chips BBQ
158
7532168
Cereal Honey
246
9876532
Chips BBQ
665
8523687
Strawberry Jam
246
1234568
Chips BBQ
158
5553215
Cereal Honey
If this is too complicated I can settle for replacing with simply the first instance without taking frequency into consideration at all.
Thanks in advance
You can use
df['Description'] = df.groupby('Group')['Description'].transform(lambda s: s.value_counts().index[0])
It seems like Series.value_counts (unlike Series.mode, which I also tried) orders elements that occur the same number of times by their first occurence. This behavior is not documented so I'm not sure you can rely on it.

Modifying dataframe rows based on list of strings

Background
I have a dataset where I have the following:
product_title price
Women's Pant 20.00
Men's Shirt 30.00
Women's Dress 40.00
Blue 4" Shorts 30.00
Blue Shorts 35.00
Green 2" Shorts 30.00
I created a new column called gender which contains the values Women, Men, or Unisex based on the specified string in product_title.
The output looks like this:
product_title price gender
Women's Pant 20.00 women
Men's Shirt 30.00 men
Women's Dress 40.00 women
Blue 4" Shorts 30.00 women
Blue Shorts 35.00 unisex
Green 2" Shorts 30.00 women
Approach
I approached creating a new column by using if/else statements:
df['gender'] = ['women' if 'women' in word or 'Blue 4"' in word or 'Green 2"' in word
else "men" if "men" in word
else "unisex"
for word in df.product_title.str.lower()]
Although this approach works, it becomes very long when I have a lot of conditions for labeling women vs men vs unisex. Is there cleaner way to do this? Is there a way I can pass a list of strings instead of having a long chain of or conditions?
I would really appreciate help as I am new to python and pandas library.
IIUC,
import numpy as np
s = df['product title'].str.lower()
df['gender'] = np.select([s.str.contains('men'),
s.str.contains('women|blue 4 shorts|green 2 shorts')],
['men', 'women'],
default='unisex')
Here is another idea with str.extract and series.map
d = {'women':['women','blue 4"','green 2"'],'men':['men']}
d1 = {val:k for k,v in d.items() for val in v}
pat = '|'.join(d1.keys())
import re
df['gender'] = (df['product_title'].str.extract('('+pat+')',flags=re.I,expand=False)
.str.lower().map(d1).fillna('unisex'))
print(df)
product_title price gender
0 Women's Pant 20.0 women
1 Men's Shirt 30.0 men
2 Women's Dress 40.0 women
3 Blue 4" Shorts 30.0 women
4 Blue Shorts 35.0 unisex
5 Green 2" Shorts 30.00 NaN women
You can try to define your own function and run it with a apply+lambda espression:
Create the function which you can change as you need:
def sex(str):
'''
look for specific values and retun value
'''
for words in ['women','Blue 4"','Green 2"']:
if words in str.lower():
return 'women'
elif 'men' in str.lower():
return 'men'
else:
return 'unisex'
and after apply to the colum you need to check for values:
df['gender']=df['product_title'].apply(lambda str: sex(str))
Cheers!
EDIT 3:
After looking around and checking about the numpy approac from #ansev following #anky comment I was able to find out this may be faster up to a certain point, tested with 5000 rows and still faster, but the numpy approach started to catch up. So it really depends on how big your dataset are.
Will remove any comment on speed considered I was testing only on this small frame initially, still a learning process as you can see from my level.

How to combine common rows in DataFrame

I'm running some analysis on bank statements (csv's). Some items like McDonalds each have their own row (due to having different addresses).
I'm trying to combine these rows by a common phrase. So for this example the obvious phrase, or string, would be "McDonalds". I think it'll be an if statement.
Also, the column has a dtype of "object". Will I have to convert it to string format?
Here is an example output of the result of printingtotali = df.Item.value_counts() from my code.
Ideally I'd want that line to output McDonalds as just a single row.
In the csv they are 2 separate rows.
foo 14
Restaurant Boulder CO 8
McDonalds Boulder CO 5
McDonalds Denver CO 5
Here's what the column data consists of
'Sukiya Greenwood Vil CO' 'Sei 34179 Denver CO' 'Chambers Place Liquors 303-3731100 CO' "Mcdonald's F26593 Fort Collins CO" 'Suh Sushi Korean Bbq Fort Collins CO' 'Conoco - Sei 26927 Fort Collins CO'
OK. I think I ginned up something that can be helpful. Realize that the task of inferring categories or names from text strings can be huge, depending on how detailed you want to get. You can dive into regex or other learning models. People make careers of it! Obviously, your bank is doing some of this as they categorize things when you get a year-end summary.
Anyhow, here is a simple way to generate some categories and use them as a basis for the grouping that you want to do.
import pandas as pd
item=['McDonalds Denver', 'Sonoco', 'ATM Fee', 'Sonoco, Ft. Collins', 'McDonalds, Boulder', 'Arco Boulder']
txn = [12.44, 4.00, 3.00, 14.99, 19.10, 52.99]
df = pd.DataFrame([item, txn]).T
df.columns = ['item_orig', 'charge']
print(df)
# let's add an extra column to catch the conversions...
df['item'] = pd.Series(dtype=str)
# we'll use the "contains" function in pandas as a simple converter... quick demo
temp = df.loc[df['item_orig'].str.contains('McDonalds')]
print('\nitems that containt the string "McDonalds"')
print(temp)
# let's build a simple conversion table in a dictionary
conversions = { 'McDonalds': 'McDonalds - any',
'Sonoco': 'gas',
'Arco': 'gas'}
# let's loop over the orig items and put conversions into the new column
# (there is probably a faster way to do this, but for data with < 100K rows, who cares.)
for key in conversions:
df['item'].loc[df['item_orig'].str.contains(key)] = conversions[key]
# see how we did...
print('converted...')
print(df)
# now move over anything that was NOT converted
# in this example, this is just the ATM Fee item...
df['item'].loc[df['item'].isnull()] = df['item_orig']
# now we have decent labels to support grouping!
print('\n\n *** sum of charges by group ***')
print(df.groupby('item')['charge'].sum())
Yields:
item_orig charge
0 McDonalds Denver 12.44
1 Sonoco 4
2 ATM Fee 3
3 Sonoco, Ft. Collins 14.99
4 McDonalds, Boulder 19.1
5 Arco Boulder 52.99
items that containt the string "McDonalds"
item_orig charge item
0 McDonalds Denver 12.44 NaN
4 McDonalds, Boulder 19.1 NaN
converted...
item_orig charge item
0 McDonalds Denver 12.44 McDonalds - any
1 Sonoco 4 gas
2 ATM Fee 3 NaN
3 Sonoco, Ft. Collins 14.99 gas
4 McDonalds, Boulder 19.1 McDonalds - any
5 Arco Boulder 52.99 gas
*** sum of charges by group ***
item
ATM Fee 3.00
McDonalds - any 31.54
gas 71.98
Name: charge, dtype: float64

Mapping values across dataframes to create a new one

I have two dataframes. The first represents the nutritional information of certain ingredients with ingredients as rows and the columns as the nutritional categories.
Item Brand and style Quantity Calories Total Fat ... Carbs Sugar Protein Fiber Sodium
0 Brown rice xxx xxxxxxxx xxxxx, long grain 150g 570 4.5 ... 1170 0 12 6 0
1 Whole wheat bread xxxxxxxx, whole grains 2 slices 220 4 ... 42 6 8 6 320
2 Whole wheat cereal xxx xxxxxxxx xxxxx, wheat squares 60g 220 1 ... 47 0 7 5 5
The second represents the type and quantity of ingredients of meals with the meals as rows and the ingredients as columns.
Meal Brown rice Whole wheat bread Whole wheat cereal ... Marinara sauce American cheese Olive oil Salt
0 Standard breakfast 0 0 1 ... 0 0 0 0
1 Standard lunch 0 2 0 ... 0 0 0 0
2 Standard dinner 0 0 0 ... 0 0 1 1
I am trying to create another dataframe such that the meals are rows and the nutritional categories are at the top, representing the entire nutritional value of the meal based on the number of ingredients.
For example, if a standard lunch consists of 2 slices of bread (150 calories each slice), 1 serving of peanut butter (100 calories), and 1 serving of jelly (50 calories), then I would like the dataframe to be like:
Meal Calories Total fat ...
Standard lunch 450 xxx
Standard dinner xxx xxx
...
450 comes from (2*150 + 100 + 50).
The function template could be:
def create_meal_category_dataframe(ingredients_df, meals_df):
ingredients = meals_df.columns[1:]
meals = meals_df['Meal']
# return meal_cat_df
I extracted lists of the meal and ingredient names, but I'm not sure if they're useful here. Thanks.

Categories