Disaggregate pandas data frame using ratios from another data frame - python

I have a pandas data frame 'High' as
segment sales
Milk 10
Chocolate 30
and another data frame 'Low' as
segment sku sales
Milk m2341 2
Milk m235 3
Chocolate c132 2
Chocolate c241 5
Chocolate c891 3
I want to use the ratios from Low to disaggregate High. So my resulting data here would be
segment sku sales
Milk m2341 4
Milk m235 6
Chocolate c132 6
Chocolate c241 15
Chocolate c891 9

First, I would find the scale we need to multiple each product sales.
df_agg = df_low[["segment", "sales"]].groupby(by=["segment"]).sum().merge(df_high, on="segment")
df_agg["scale"] = df_agg["sales_y"] / df_agg["sales_x"]
Then, apply the scale
df_disagg_high = df_low.merge(df_agg[["segment", "scale"]])
df_disagg_high["adjusted_sale"] = df_disagg_high["sales"] * df_disagg_high["scale"]
If needed, you can exclude extra columns.

Try:
df_low["sales"] = df_low.sales.mul(
df_low.merge(
df_high.set_index("segment")["sales"].div(
df_low.groupby("segment")["sales"].sum()
),
on="segment",
)["sales_y"]
).astype(int)
print(df_low)
Prints:
segment sku sales
0 Milk m2341 4
1 Milk m235 6
2 Chocolate c132 6
3 Chocolate c241 15
4 Chocolate c891 9

Related

Pythonic way to regroup a pandas dataframe using max of a column [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 1 year ago.
I have the following data frame that has been obtained by applying df.groupby(['category', 'unit_quantity']).count()
category
unit_quantity
Count
banana
1EA
5
eggs
100G
22
100ML
1
full cream milk
100G
5
100ML
1
1L
38
Let's call this latter dataframe as grouped. I want to find a way to regroup using columns unit_quantity and Count it and get
category
unit_quantity
Count
Most Frequent unit_quantity
banana
1EA
5
1EA
eggs
100G
22
100G
100ML
1
100G
full cream milk
100G
5
1L
100ML
1
1L
1L
38
1L
Now, I tried to apply grouped.groupby(level=1).max() which gives me
unit_quantity
100G
22
100ML
1
1EA
5
1L
38
Now, because the indices of the latter and grouped do not coincide, I cannot join it using .merge. Does someone know how to solve this issue?
Thanks in advance
Starting from your DataFrame :
>>> import pandas as pd
>>> df = pd.DataFrame({'category': ['banana', 'eggs', 'eggs', 'full cream milk', 'full cream milk', 'full cream milk'],
... 'unit_quantity': ['1EA', '100G', '100ML', '100G', '100ML', '1L'],
... 'Count': [5, 22, 1, 5, 1, 38],},
... index = [0, 1, 2, 3, 4, 5])
>>> df
category unit_quantity Count
0 banana 1EA 5
1 eggs 100G 22
2 eggs 100ML 1
3 full cream milk 100G 5
4 full cream milk 100ML 1
5 full cream milk 1L 38
You can use the transform method applied on max of the column Count in order to keep your category and unit_quantity values :
>>> idx = df.groupby(['unit_quantity'])['Count'].transform(max) == df['Count']
>>> df[idx]
category unit_quantity Count
0 banana 1EA 5
1 eggs 100G 22
2 eggs 100ML 1
4 full cream milk 100ML 1
5 full cream milk 1L 38

adding values in new column based on string contains in another column

I have DataFarame
date descriptions Code
1. 1/1/2020 this is aPple 6546
2. 21/8/2019 this is fan for him 4478
3. 15/3/2020 this is ball of hockey 5577
4. 12/2/2018 this is Green apple 7899
5. 13/3/2002 this is iron fan 7788
6. 14/5/2020 this ball is soft 9991
I want to create a new column 'category' whose value would be if there is expression apple, fan, ball(capital or small letters) in the description column then value A001, F009, B099 respectively should be enter in the category column, required DataFrame would be.
date descriptions Code category
1. 1/1/2020 this is aPple 6546 A001
2. 21/8/2019 this is fan for him 4478 F009
3. 15/3/2020 this is ball of hockey 5577 B099
4. 12/2/2018 this is Green apple 7899 A001
5. 13/3/2002 this is iron fan 7788 F009
6. 14/5/2020 this ball is soft 9991 B099
Use str.extract to get the substring from the string-based column
d = {'apple': 'A001', 'ball': 'B099', 'fan': 'F009'}
df['category'] = (
df.descriptions
.str.lower()
.str.extract('(' + '|'.join(d.keys()) + ')')
.squeeze().map(d)
)
You can use numpy select, which allows for multiple conditional selection.
content = ["apple", "fan", "ball"]
condlist = [df.descriptions.str.lower().str.contains(letter) for letter in content]
choicelist = ["A001", "F009", "B099"]
df["category"] = np.select(condlist, choicelist)
df
date descriptions Code category
0 1/1/2020 this is aPple 6546 A001
1 21/8/2019 this is fan for him 4478 F009
2 15/3/2020 this is ball of hockey 5577 B099
3 12/2/2018 this is Green apple 7899 A001
4 13/3/2002 this is iron fan 7788 F009
5 14/5/2020 this ball is soft 9991 B099

Splitting a dataframe column on a pattern of characters and numerals

I have a dataframe that is:
A
1 king, crab, 2008
2 green, 2010
3 blue
4 green no. 4
5 green, house
I want to split the dates out into:
A B
1 king, crab 2008
2 green 2010
3 blue
4 green no. 4
5 green, house
I cant split the first instance of ", " because that would make:
A B
1 king crab, 2008
2 green 2010
3 blue
4 green no. 4
5 green house
I cant split after the last instance of ", " because that would make:
A B
1 king crab 2008
2 green 2010
3 blue
4 green no. 4
5 green house
I also cant separate it by numbers because that would make:
A B
1 king, crab 2008
2 green 2010
3 blue
4 green no. 4
5 green, house
Is there some way to split by ", " and then a 4 digit number that is between two values? The two values condition would be extra safety to filter out accidental 4 digit numbers that are clearly not years. For example.
Split by:
", " + (four digit number between 1000 - 2021)
Also appreciated are answers that split by:
", " + four digit number
Even better would be an answer that took into account that the number is ALWAYS at the end of the string.
Or you can just use series.str.extract and replace:
df = pd.DataFrame({"A":["king, crab, 2008","green, 2010","blue","green no. 4","green, house"]})
df["year"] = df["A"].str.extract("(\d{4})")
df["A"] = df["A"].str.replace(",\s\d{4}","")
print (df)
A year
0 king, crab 2008
1 green 2010
2 blue NaN
3 green no. 4 NaN
4 green, house NaN
import pandas as pd
list_dict_Input = [{'A': 'king, crab, 2008'},
{'A': 'green, 2010'},
{'A': 'green no. 4'},
{'A': 'green no. 4'},]
df = pd.DataFrame(list_dict_Input)
for row_Index in range(len(df)):
text = (df.iloc[row_Index]['A']).strip()
last_4_Char = (text[-4:])
if last_4_Char.isdigit() and int(last_4_Char) >= 1000 and int(last_4_Char) <= 2021:
df.at[row_Index, 'B'] = last_4_Char
print(df)

Mapping values across dataframes to create a new one

I have two dataframes. The first represents the nutritional information of certain ingredients with ingredients as rows and the columns as the nutritional categories.
Item Brand and style Quantity Calories Total Fat ... Carbs Sugar Protein Fiber Sodium
0 Brown rice xxx xxxxxxxx xxxxx, long grain 150g 570 4.5 ... 1170 0 12 6 0
1 Whole wheat bread xxxxxxxx, whole grains 2 slices 220 4 ... 42 6 8 6 320
2 Whole wheat cereal xxx xxxxxxxx xxxxx, wheat squares 60g 220 1 ... 47 0 7 5 5
The second represents the type and quantity of ingredients of meals with the meals as rows and the ingredients as columns.
Meal Brown rice Whole wheat bread Whole wheat cereal ... Marinara sauce American cheese Olive oil Salt
0 Standard breakfast 0 0 1 ... 0 0 0 0
1 Standard lunch 0 2 0 ... 0 0 0 0
2 Standard dinner 0 0 0 ... 0 0 1 1
I am trying to create another dataframe such that the meals are rows and the nutritional categories are at the top, representing the entire nutritional value of the meal based on the number of ingredients.
For example, if a standard lunch consists of 2 slices of bread (150 calories each slice), 1 serving of peanut butter (100 calories), and 1 serving of jelly (50 calories), then I would like the dataframe to be like:
Meal Calories Total fat ...
Standard lunch 450 xxx
Standard dinner xxx xxx
...
450 comes from (2*150 + 100 + 50).
The function template could be:
def create_meal_category_dataframe(ingredients_df, meals_df):
ingredients = meals_df.columns[1:]
meals = meals_df['Meal']
# return meal_cat_df
I extracted lists of the meal and ingredient names, but I'm not sure if they're useful here. Thanks.

Generate columns of top ranked values in Pandas

I have a dataframe topic_data that contains the output of an LDA topic model:
topic_data.head(15)
topic word score
0 0 Automobile 0.063986
1 0 Vehicle 0.017457
2 0 Horsepower 0.015675
3 0 Engine 0.014857
4 0 Bicycle 0.013919
5 1 Sport 0.032938
6 1 Association_football 0.025324
7 1 Basketball 0.020949
8 1 Baseball 0.016935
9 1 National_Football_League 0.016597
10 2 Japan 0.051454
11 2 Beer 0.032839
12 2 Alcohol 0.027909
13 2 Drink 0.019494
14 2 Vodka 0.017908
This shows the top 5 terms for each topic, and the score (weight) for each. What I'm trying to do is reformat so that the index is the rank of the term, the columns are the topic IDs, and the values are formatted strings generated from the word and score columns (something along the lines of "%s (%.02f)" % (word,score)). That means the new dataframe should look something like this:
Topic 0 1 ...
Rank
0 Automobile (0.06) Sport (0.03) ...
1 Vehicle (0.017) Association_football (0.03) ...
... ... ... ...
What's the right way of going about this? I assume it involves a combination of index-setting, unstacking, and ranking, but I'm not sure of the right approach.
It would be something like this, note that Rank has to be generated first:
In [140]:
df['Rank'] = (-1*df).groupby('topic').score.transform(np.argsort)
df['New_str'] = df.word + df.score.apply(' ({0:.2f})'.format)
df2 = df.sort(['Rank', 'score'])[['New_str', 'topic','Rank']]
print df2.pivot(index='Rank', values='New_str', columns='topic')
topic 0 1 2
Rank
0 Automobile (0.06) Sport (0.03) Japan (0.05)
1 Vehicle (0.02) Association_football (0.03) Beer (0.03)
2 Horsepower (0.02) Basketball (0.02) Alcohol (0.03)
3 Engine (0.01) Baseball (0.02) Drink (0.02)
4 Bicycle (0.01) National_Football_League (0.02) Vodka (0.02)

Categories