Mapping values across dataframes to create a new one - python

I have two dataframes. The first represents the nutritional information of certain ingredients with ingredients as rows and the columns as the nutritional categories.
Item Brand and style Quantity Calories Total Fat ... Carbs Sugar Protein Fiber Sodium
0 Brown rice xxx xxxxxxxx xxxxx, long grain 150g 570 4.5 ... 1170 0 12 6 0
1 Whole wheat bread xxxxxxxx, whole grains 2 slices 220 4 ... 42 6 8 6 320
2 Whole wheat cereal xxx xxxxxxxx xxxxx, wheat squares 60g 220 1 ... 47 0 7 5 5
The second represents the type and quantity of ingredients of meals with the meals as rows and the ingredients as columns.
Meal Brown rice Whole wheat bread Whole wheat cereal ... Marinara sauce American cheese Olive oil Salt
0 Standard breakfast 0 0 1 ... 0 0 0 0
1 Standard lunch 0 2 0 ... 0 0 0 0
2 Standard dinner 0 0 0 ... 0 0 1 1
I am trying to create another dataframe such that the meals are rows and the nutritional categories are at the top, representing the entire nutritional value of the meal based on the number of ingredients.
For example, if a standard lunch consists of 2 slices of bread (150 calories each slice), 1 serving of peanut butter (100 calories), and 1 serving of jelly (50 calories), then I would like the dataframe to be like:
Meal Calories Total fat ...
Standard lunch 450 xxx
Standard dinner xxx xxx
...
450 comes from (2*150 + 100 + 50).
The function template could be:
def create_meal_category_dataframe(ingredients_df, meals_df):
ingredients = meals_df.columns[1:]
meals = meals_df['Meal']
# return meal_cat_df
I extracted lists of the meal and ingredient names, but I'm not sure if they're useful here. Thanks.

Related

Python: Counting values for columns with multiple values per entry in dataframe

I have a dataframe of restaurants and one column has corresponding cuisines.
The problem is that there are restaurants with multiple cuisines in the same column [up to 8].
Let's say it's something like this:
RestaurantName City Restaurant ID Cuisines
Restaurant A Milan 31333 French, Spanish, Italian
Restaurant B Shanghai 63551 Pizza, Burgers
Restaurant C Dubai 7991 Burgers, Ice Cream
Here's a copy-able code as a sample:
rst= pd.DataFrame({'RestaurantName': ['Rest A', 'Rest B', 'Rest C'],
'City': ['Milan', 'Shanghai', 'Dubai'],
'RestaurantID': [31333,63551,7991],
'Cuisines':['French, Spanish, Italian','Pizza, Burgers','Burgers, Ice Cream']})
I used string split to expand them into 8 different columns and added it to the dataframe.
csnsplit=rst.Cuisines.str.split(", ",expand=True)
rst["Cuisine1"]=csnsplit.loc[:,0]
rst["Cuisine2"]=csnsplit.loc[:,1]
rst["Cuisine3"]=csnsplit.loc[:,2]
rst["Cuisine4"]=csnsplit.loc[:,3]
rst["Cuisine5"]=csnsplit.loc[:,4]
rst["Cuisine6"]=csnsplit.loc[:,5]
rst["Cuisine7"]=csnsplit.loc[:,6]
rst["Cuisine8"]=csnsplit.loc[:,7]
Which leaves me with this:
https://i.stack.imgur.com/AUSDY.png
Now I have no idea how to count individual cuisines since they're across up to 8 different columns, let's say if I want to see top cuisine by city.
I also tried getting dummy columns for all of them, Cuisine 1 to Cuisine 8. This is causing me to have duplicates like Cuisine1_Bakery, Cusine2_Bakery, and so on. I could hypothetically merge like ones and keeping only the one that has a count of "1," but no idea how to do that.
dummies=pd.get_dummies(data=rst,columns=["Cuisine1","Cuisine2","Cuisine3","Cuisine4","Cuisine5","Cuisine6","Cuisine7","Cuisine8"])
print(dummies.columns.tolist())
Which leaves me with all of these columns:
https://i.stack.imgur.com/84spI.png
A third thing I tried was to get unique values from all 8 columns, and I have a deduped list of each type of cuisine. I can probably add all these columns to the dataframe, but wouldn't know how to fill the rows with a count for each one based on the column name.
AllCsn=np.concatenate((rst.Cuisine1.unique(),
rst.Cuisine2.unique(),
rst.Cuisine3.unique(),
rst.Cuisine4.unique(),
rst.Cuisine5.unique(),
rst.Cuisine6.unique(),
rst.Cuisine7.unique(),
rst.Cuisine8.unique()
))
AllCsn=np.unique(AllCsn.astype(str))
AllCsn
Which leaves me with this:
https://i.stack.imgur.com/O9OpW.png
I do want to create a model later on where I maybe have a column for each cuisine, and use the "unique" code above to get all the columns, but then I would need to figure out how to do a count based on the column header.
I am new to this, so please bear with me and let me know if I need to provide any more info.
It sounds like you're looking for str.split without expanding, then explode:
rst['Cuisines'] = rst['Cuisines'].str.split(', ')
rst = rst.explode('Cuisines')
Creates a frame like:
RestaurantName City RestaurantID Cuisines
0 Rest A Milan 31333 French
0 Rest A Milan 31333 Spanish
0 Rest A Milan 31333 Italian
1 Rest B Shanghai 63551 Pizza
1 Rest B Shanghai 63551 Burgers
2 Rest C Dubai 7991 Burgers
2 Rest C Dubai 7991 Ice Cream
Then it sounds like either crosstab:
pd.crosstab(rst['City'], rst['Cuisines'])
Cuisines Burgers French Ice Cream Italian Pizza Spanish
City
Dubai 1 0 1 0 0 0
Milan 0 1 0 1 0 1
Shanghai 1 0 0 0 1 0
Or value_counts
rst[['City', 'Cuisines']].value_counts().reset_index(name='counts')
City Cuisines counts
0 Dubai Burgers 1
1 Dubai Ice Cream 1
2 Milan French 1
3 Milan Italian 1
4 Milan Spanish 1
5 Shanghai Burgers 1
6 Shanghai Pizza 1
Max value_count per City via groupby head:
max_counts = (
rst[['City', 'Cuisines']].value_counts()
.groupby(level=0).head(1)
.reset_index(name='counts')
)
max_counts:
City Cuisines counts
0 Dubai Burgers 1
1 Milan French 1
2 Shanghai Burgers 1

Disaggregate pandas data frame using ratios from another data frame

I have a pandas data frame 'High' as
segment sales
Milk 10
Chocolate 30
and another data frame 'Low' as
segment sku sales
Milk m2341 2
Milk m235 3
Chocolate c132 2
Chocolate c241 5
Chocolate c891 3
I want to use the ratios from Low to disaggregate High. So my resulting data here would be
segment sku sales
Milk m2341 4
Milk m235 6
Chocolate c132 6
Chocolate c241 15
Chocolate c891 9
First, I would find the scale we need to multiple each product sales.
df_agg = df_low[["segment", "sales"]].groupby(by=["segment"]).sum().merge(df_high, on="segment")
df_agg["scale"] = df_agg["sales_y"] / df_agg["sales_x"]
Then, apply the scale
df_disagg_high = df_low.merge(df_agg[["segment", "scale"]])
df_disagg_high["adjusted_sale"] = df_disagg_high["sales"] * df_disagg_high["scale"]
If needed, you can exclude extra columns.
Try:
df_low["sales"] = df_low.sales.mul(
df_low.merge(
df_high.set_index("segment")["sales"].div(
df_low.groupby("segment")["sales"].sum()
),
on="segment",
)["sales_y"]
).astype(int)
print(df_low)
Prints:
segment sku sales
0 Milk m2341 4
1 Milk m235 6
2 Chocolate c132 6
3 Chocolate c241 15
4 Chocolate c891 9

Randomly chunk variables to groups of a certain number

I have a large pandas dataframe in which I am attempting to randomly chunk objects into groups of a certain number. For example, I am attempting to chunk the below objects into groups of 3. However, groups must be from the same type. Here's a toy dataset:
type object index
ball soccer 1
ball soccer 2
ball basket 1
ball bouncy 1
ball tennis 1
ball tennis 2
chair office 1
chair office 2
chair office 3
chair lounge 1
chair dining 1
chair dining 2
... ... ...
Desired output:
type object index group
ball soccer 1 ball_1
ball soccer 2 ball_1
ball basket 1 ball_1
ball bouncy 1 ball_1
ball tennis 1 ball_2
ball tennis 2 ball_2
chair office 1 chair_1
chair office 2 chair_1
chair office 3 chair_1
chair lounge 1 chair_1
chair dining 1 chair_1
chair dining 2 chair_1
... ... ... ...
So here, the group ball_1 contains 3 unique objects from the same type: soccer, basket, and bouncy. The remainder object goes into group ball_2 which only has 1 object. Since the dataframe is so large, I'm hoping for a long list of groups that contain 3 objects and one group that contains the remainder objects (anything less than 3).
Again, while my example only contains a few objects, I'm hoping for the objects to be randomly sorted into groups of 3. (My real dataset will contain many more balls and chairs.)
This seemed helpful, but I haven't figured out how to apply it yet: How do you split a list into evenly sized chunks?
If need split for each N values per groups by type is possible use factorize with GroupBy.transform, integer divide and add 1, last add column type in Series.str.cat:
N = 3
g = df.groupby('type')['object'].transform(lambda x: pd.factorize(x)[0]) // N + 1
df['group'] = df['type'].str.cat(g.astype(str), '_')
print (df)
type object index group
0 ball soccer 1 ball_1
1 ball soccer 2 ball_1
2 ball basket 1 ball_1
3 ball bouncy 1 ball_1
4 ball tennis 1 ball_2
5 ball tennis 2 ball_2
6 chair office 1 chair_1
7 chair office 2 chair_1
8 chair office 3 chair_1
9 chair lounge 1 chair_1
10 chair dining 1 chair_1
If need also some randomize values add DataFrame.sample:
N = 3
df = df.sample(frac=1)
g = df.groupby('type')['object'].transform(lambda x: pd.factorize(x)[0]) // N + 1
df['group'] = df['type'].str.cat(g.astype(str), '_')
print (df)
type object index group
10 chair dining 1 chair_1
8 chair office 3 chair_1
2 ball basket 1 ball_1
1 ball soccer 2 ball_1
7 chair office 2 chair_1
0 ball soccer 1 ball_1
9 chair lounge 1 chair_1
4 ball tennis 1 ball_1
6 chair office 1 chair_1
3 ball bouncy 1 ball_2
5 ball tennis 2 ball_1

Merge two pandas dataframe two create a new dataframe with a specific operation

I have two dataframes as shown below.
Company Name BOD Position Ethnicity DOB Age Gender Degree ( Specialazation) Remark
0 Big Lots Inc. David J. Campisi Director, President and Chief Executive Offic... American 1956 61 Male Graduate NaN
1 Big Lots Inc. Philip E. Mallott Chairman of the Board American 1958 59 Male MBA, Finace NaN
2 Big Lots Inc. James R. Chambers Independent Director American 1958 59 Male MBA NaN
3 Momentive Performance Materials Inc Mahesh Balakrishnan director Asian 1983 34 Male BA Economics NaN
Company Name Net Sale Gross Profit Remark
0 Big Lots Inc. 5.2B 2.1B NaN
1 Momentive Performance Materials Inc 544M 146m NaN
2 Markel Corporation 5.61B 2.06B NaN
3 Noble Energy, Inc. 3.49B 2.41B NaN
4 Leidos Holding, Inc. 7.04B 852M NaN
I want to create a new dataframe with these two, so that in 2nd dataframe, I have new columns with count of ethinicities from each companies, such as American -2 Mexican -5 and so on, so that later on, i can calculate diversity score.
the variables in the output dataframe is like,
Company Name Net Sale Gross Profit Remark American Mexican German .....
Big Lots Inc. 5.2B 2.1B NaN 2 0 5 ....
First get counts per groups by groupby with size and unstack, last join to second DataFrame:
df1 = pd.DataFrame({'Company Name':list('aabcac'),
'Ethnicity':['American'] * 3 + ['Mexican'] * 3})
df1 = df1.groupby(['Company Name', 'Ethnicity']).size().unstack(fill_value=0)
#slowier alternative
#df1 = pd.crosstab(df1['Company Name'], df1['Ethnicity'])
print (df1)
Ethnicity American Mexican
Company Name
a 2 1
b 1 0
c 0 2
df2 = pd.DataFrame({'Company Name':list('abc')})
print (df2)
Company Name
0 a
1 b
2 c
df3 = df2.join(df1, on=['Company Name'])
print (df3)
Company Name American Mexican
0 a 2 1
1 b 1 0
2 c 0 2
EDIT: You need replace unit by 0 and convert to floats:
print (df)
Name sale
0 A 100M
1 B 200M
2 C 5M
3 D 40M
4 E 10B
5 F 2B
d = {'M': '0'*6, 'B': '0'*9}
df['a'] = df['sale'].replace(d, regex=True).astype(float).sort_values(ascending=False)
print (df)
Name sale a
0 A 100M 1.000000e+08
1 B 200M 2.000000e+08
2 C 5M 5.000000e+06
3 D 40M 4.000000e+07
4 E 10B 1.000000e+10
5 F 2B 2.000000e+09

Generate columns of top ranked values in Pandas

I have a dataframe topic_data that contains the output of an LDA topic model:
topic_data.head(15)
topic word score
0 0 Automobile 0.063986
1 0 Vehicle 0.017457
2 0 Horsepower 0.015675
3 0 Engine 0.014857
4 0 Bicycle 0.013919
5 1 Sport 0.032938
6 1 Association_football 0.025324
7 1 Basketball 0.020949
8 1 Baseball 0.016935
9 1 National_Football_League 0.016597
10 2 Japan 0.051454
11 2 Beer 0.032839
12 2 Alcohol 0.027909
13 2 Drink 0.019494
14 2 Vodka 0.017908
This shows the top 5 terms for each topic, and the score (weight) for each. What I'm trying to do is reformat so that the index is the rank of the term, the columns are the topic IDs, and the values are formatted strings generated from the word and score columns (something along the lines of "%s (%.02f)" % (word,score)). That means the new dataframe should look something like this:
Topic 0 1 ...
Rank
0 Automobile (0.06) Sport (0.03) ...
1 Vehicle (0.017) Association_football (0.03) ...
... ... ... ...
What's the right way of going about this? I assume it involves a combination of index-setting, unstacking, and ranking, but I'm not sure of the right approach.
It would be something like this, note that Rank has to be generated first:
In [140]:
df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort)
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format)
df2 = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']]
print df2.pivot(index='Rank', values='New_str', columns='topic')
topic 0 1 2
Rank
0 Automobile (0.06) Sport (0.03) Japan (0.05)
1 Vehicle (0.02) Association_football (0.03) Beer (0.03)
2 Horsepower (0.02) Basketball (0.02) Alcohol (0.03)
3 Engine (0.01) Baseball (0.02) Drink (0.02)
4 Bicycle (0.01) National_Football_League (0.02) Vodka (0.02)

Categories