Suppose I have a DataFrame that looks something like this:
id
country
grade
category
amount
0
7
fr
a
mango
52
1
5
fr
b
banana
68
2
7
fr
a
banana
73
3
4
it
c
mango
70
4
5
fr
b
banana
99
5
9
uk
a
apple
29
6
3
uk
a
mango
83
7
0
uk
b
banana
59
8
2
it
c
mango
11
9
9
uk
a
banana
91
10
0
uk
b
mango
95
11
8
uk
a
mango
30
12
3
uk
a
mango
82
13
1
it
b
banana
78
14
3
uk
a
apple
76
15
6
it
c
apple
76
16
2
it
c
mango
10
17
1
it
b
mango
30
18
9
uk
a
banana
17
19
2
it
c
mango
58
Where each id belongs to a grade and lives in a country, and spends a certain amount on various fruits (category). Let's say the data covers a whole year. (Dataframe reproducible using the code below.)
import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
I would like to add two columns to this DF.
First, I'd like a column giving the mean annual (ie total) spent on each category by each combination of country and grade. So, for example, the Italy C-grade people have spent the following on mangos:
id: 4 total: 70
id: 2 total: 11 + 10 + 58 = 79
So the mean annual mango spend for Italy C-grade people is 74.5. I'd like to find this value for all of the country/grade/category combinations.
The second column I want to add is the same but for the mean annual count for each combination.
Desired output and the best I could come up with:
I've managed to populate these two desired columns using the following code:
import math
combos = [[i,j,k] for i in set(df["country"]) for j in set(df["grade"]) for k in set(df["category"])]
for c in combos:
x = df.loc[(df["country"]==c[0])&(df["grade"]==c[1])&(df["category"]==c[2])]
m = x.groupby("id").sum()["amount"].mean()
k = x.groupby("id").count()["amount"].mean()
if math.isnan(m):
m = 0
if math.isnan(k):
k = 0
c.append(m)
c.append(k)
temp_grouping = pd.DataFrame(combos,columns=["country","grade","category","mean_totals","mean_counts"])
df = df.merge(temp_grouping,on=["country","grade","category"],how="left")
Which gives the desired output:
id
country
grade
category
amount
mean_totals
mean_counts
0
7
fr
a
mango
52
52
1
1
5
fr
b
banana
68
167
2
2
7
fr
a
banana
73
73
1
3
4
it
c
mango
70
74.5
2
4
5
fr
b
banana
99
167
2
5
9
uk
a
apple
29
52.5
1
6
3
uk
a
mango
83
97.5
1.5
7
0
uk
b
banana
59
59
1
8
2
it
c
mango
11
74.5
2
9
9
uk
a
banana
91
108
2
10
0
uk
b
mango
95
95
1
11
8
uk
a
mango
30
97.5
1.5
12
3
uk
a
mango
82
97.5
1.5
13
1
it
b
banana
78
78
1
14
3
uk
a
apple
76
52.5
1
15
6
it
c
apple
76
76
1
16
2
it
c
mango
10
74.5
2
17
1
it
b
mango
30
30
1
18
9
uk
a
banana
17
108
2
19
2
it
c
mango
58
74.5
2
The above code works, but it is not usable on my real data because it is pretty slow. I'm searching, therefore, for a faster/more efficient solution to my problem. Thanks very much.
You can create mean_totals column as follows:
mean_total_df = df.groupby(['country', 'category', 'grade']).apply(lambda x: x.amount.sum()/ x.id.nunique())
df['mean_totals'] = df.apply(lambda x: mean_total_df.loc[x.country, x.category, x.grade], axis=1)
which gives
0 7 fr a mango 52 52.0
1 5 fr b banana 68 167.0
2 7 fr a banana 73 73.0
3 4 it c mango 70 74.5
4 5 fr b banana 99 167.0
5 9 uk a apple 29 52.5
6 3 uk a mango 83 97.5
7 0 uk b banana 59 59.0
8 2 it c mango 11 74.5
9 9 uk a banana 91 108.0
10 0 uk b mango 95 95.0
11 8 uk a mango 30 97.5
12 3 uk a mango 82 97.5
13 1 it b banana 78 78.0
14 3 uk a apple 76 52.5
15 6 it c apple 76 76.0
16 2 it c mango 10 74.5
17 1 it b mango 30 30.0
18 9 uk a banana 17 108.0
19 2 it c mango 58 74.5
It looks like you need a double groupby. Once for the sum, once for the mean:
out = (df
.groupby(['country', 'grade', 'category', 'id']).sum()
.groupby(['country', 'grade', 'category']).mean()
)
output:
amount
country grade category
fr a banana 73.0
mango 52.0
b banana 167.0
it b banana 78.0
mango 30.0
c apple 76.0
mango 74.5
uk a apple 52.5
banana 108.0
mango 97.5
b banana 59.0
mango 95.0
I hope this will work fast.. First group and compute the required details and merge with existing df.
import pandas as pd
df = pd.DataFrame({
"id":[7,5,7,4,5,9,3,0,2,9,0,8,3,1,3,6,2,1,9,2],
"country":["fr","fr","fr","it","fr","uk","uk","uk","it","uk","uk","uk","uk","it","uk","it","it","it","uk","it"],
"grade":["a","b","a","c","b","a","a","b","c","a","b","a","a","b","a","c","c","b","a","c"],
"category":["mango","banana","banana","mango","banana","apple","mango","banana","mango","banana","mango","mango","mango","banana","apple","apple","mango","mango","banana","mango"],
"amount":[52,68,73,70,99,29,83,59,11,91,95,30,82,78,76,76,10,30,17,58]
})
intermediate_df = df.groupby(by=['country','grade','category','id'], as_index=False).agg(int_totals=pd.NamedAgg(column='amount',aggfunc='sum'),int_counts=pd.NamedAgg(column='id',aggfunc='count')).groupby(by=['country','grade','category'], as_index=False).agg(mean_totals=pd.NamedAgg(column='int_totals',aggfunc='mean'),mean_counts=pd.NamedAgg(column='int_counts',aggfunc='mean'))
output_df = pd.merge(df,intermediate_df, left_on = ['country','grade','category'],right_on = ['country','grade','category'], how='left')
print(output_df)
Output_dataframe
Is there a way to modify the expanding windows in Pandas. For example consider a random DF:
df = pd.DataFrame(np.random.randint(0,60,size=(10,3)),columns=["a","b","c"])
df["d1"]=["Apple","Mango","Apple","Apple","Mango","Mango","Apple","Mango","Apple","Apple"]
df["d2"]=["Orange","lemon","lemon","Orange","lemon","Orange","lemon","Orange","lemon","Orange"]
df["date"] = ["2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-01-01","2002-02-01","2002-02-01","2002-02-01"]
df["date"] = pd.to_datetime(df["date"])
df
a b c d1 d2 date
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
7 31 4 12 Mango Orange 2002-02-01
8 33 36 54 Apple lemon 2002-02-01
9 50 22 48 Apple Orange 2002-02-01
When I to df.expanding(1).apply(), it applies the function to expanding every row, is it possible to pass the date column to the expanding function so instead of every row as a window, it accumulates groups of rows based on date
Existing expanding window:
window 1: 0 16 25 37 Apple Orange 2002-01-01
window 2: 0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
window 3: 0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
Expected expanding window:
window 1 (all rows for date "2002-01-01"):
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
window 2 (all rows for date "2002-01-01" and "2002-02-01" ):
0 16 25 37 Apple Orange 2002-01-01
1 24 41 32 Mango lemon 2002-01-01
2 41 20 53 Apple lemon 2002-01-01
3 4 28 47 Apple Orange 2002-01-01
4 7 29 10 Mango lemon 2002-01-01
5 6 54 15 Mango Orange 2002-01-01
6 26 54 35 Apple lemon 2002-01-01
7 31 4 12 Mango Orange 2002-02-01
8 33 36 54 Apple lemon 2002-02-01
9 50 22 48 Apple Orange 2002-02-01
Assuming df.date is sorted.
I suppose that there is more efficient way to calculate end. If you find better solution please let me know.
Another usage, Pandas custom window rolling, Possible problems
import pandas as pd
import numpy as np
from pandas.api.indexers import BaseIndexer
from typing import Optional, Tuple
class CustomIndexer(BaseIndexer):
def get_window_bounds(self,
num_values: int = 0,
min_periods: Optional[int] = None,
center: Optional[bool] = None,
closed: Optional[str] = None
) -> Tuple[np.ndarray, np.ndarray]:
lst = []
prev_date = self.custom_name[0]
for i, date in enumerate(self.custom_name[1:], 1):
if prev_date != date:
lst.append(i)
prev_date = date
lst.append(len(df))
end = np.array(lst)
start = np.zeros_like(end)
return start, end
indexer = CustomIndexer(custom_name=df.date)
for window in df.rolling(indexer):
print(window)
Outputs:
a b c d1 d2 date
0 17 27 35 Apple Orange 2002-01-01
1 39 10 57 Mango lemon 2002-01-01
2 8 31 12 Apple lemon 2002-01-01
3 20 17 23 Apple Orange 2002-01-01
4 11 26 41 Mango lemon 2002-01-01
5 52 57 9 Mango Orange 2002-01-01
6 40 15 33 Apple lemon 2002-01-01
a b c d1 d2 date
0 17 27 35 Apple Orange 2002-01-01
1 39 10 57 Mango lemon 2002-01-01
2 8 31 12 Apple lemon 2002-01-01
3 20 17 23 Apple Orange 2002-01-01
4 11 26 41 Mango lemon 2002-01-01
5 52 57 9 Mango Orange 2002-01-01
6 40 15 33 Apple lemon 2002-01-01
7 17 17 37 Mango Orange 2002-02-01
8 5 53 0 Apple lemon 2002-02-01
9 16 10 24 Apple Orange 2002-02-01
Fruit January Shipments January Sales February Shipments February Sales
------------ ------------------- --------------- -------------------- ----------------
Apple 30 11 18 31
Banana 12 49 39 14
Pear 25 50 44 21
Kiwi 41 25 10 25
Strawberry 11 33 35 50
I'm trying to achieve the following result:
Fruit Month Shipments Sales
------------ ---------- ----------- -------
Apple January 30 11
Banana January 12 49
Pear January 25 50
Kiwi January 41 25
Strawberry January 11 33
Apple February 18 31
Banana February 39 14
Pear February 44 21
Kiwi February 10 25
Strawberry February 35 50
I've tried pandas.pivot and pandas.pivot_table and had no luck. I'm in the process of creating two dataframes (Fruit/Month/Shipments) and (Fruit/Month/Sales), and concatenating the two into one with a loop, but I was hoping for a easier way to do this.
one way is to use modify the column to a multi level then use stack. Let suppose your dataframe is called df. First set the column Fruit as index, then define the multilevel columns:
df = df.set_index('Fruit')
# manual way to create the multiindex columns
#df.columns = pd.MultiIndex.from_product([['January','February'],
# ['Shipments','Sales']], names=['Month',None])
# more general way to create the multiindex columns thanks to #Scott Boston
df.columns = df.columns.str.split(expand=True)
df.columns.names = ['Month',None]
your data looks like:
Month January February
Shipments Sales Shipments Sales
Fruit
Apple 30 11 18 31
Banana 12 49 39 14
Pear 25 50 44 21
Kiwi 41 25 10 25
Strawberry 11 33 35 50
Now you can use stack on level 0 and reset_index
df_output = df.stack(0).reset_index()
which gives
Fruit Month Sales Shipments
0 Apple February 31 18
1 Apple January 11 30
2 Banana February 14 39
3 Banana January 49 12
4 Pear February 21 44
5 Pear January 50 25
6 Kiwi February 25 10
7 Kiwi January 25 41
8 Strawberry February 50 35
9 Strawberry January 33 11
Finally, if you want a specific order for values in the column Month you can use pd.Categorical:
df_output['Month'] = pd.Categorical(df_output['Month'].tolist(), ordered=True,
categories=['January','February'])
setting that January is before February when sorting. Now, doing
df_output = df_output.sort_values(['Month'])
gives the result:
Fruit Month Sales Shipments
1 Apple January 11 30
3 Banana January 49 12
5 Pear January 50 25
7 Kiwi January 25 41
9 Strawberry January 33 11
0 Apple February 31 18
2 Banana February 14 39
4 Pear February 21 44
6 Kiwi February 25 10
8 Strawberry February 50 35
I see it's not exactly the expected output (order in Fruit column and order of columns) but both can be easily change if needed.
How to use pd.wide_to_long as #user3483203 suggests.
df1 = df.set_index('Fruit')
#First we have to so column renaming use multiindex column headers and swapping levels.
df1.columns = df1.columns.str.split(expand=True)
df1.columns = df1.columns.map('{0[1]}_{0[0]}'.format)
#Reset index and use pd.wide_to_long:
df1 = df1.reset_index()
df_out = pd.wide_to_long(df1, ['Shipments','Sales'], 'Fruit', 'Month','_','\w+')\
.reset_index()
print(df_out)
Output:
Fruit Month Shipments Sales
0 Apple January 30.0 11.0
1 Banana January 12.0 49.0
2 Pear January 25.0 50.0
3 Kiwi January 41.0 25.0
4 Strawberry January 11.0 33.0
5 Apple February 18.0 31.0
6 Banana February 39.0 14.0
7 Pear February 44.0 21.0
8 Kiwi February 10.0 25.0
9 Strawberry February 35.0 50.0
I'm having trouble with using the stack() function on a section of a dataframe in pandas and then merging that stacked data back into the original dataframe.
To explain more understandably through an example, suppose I have the following df:
>>>df
date name favorite_color day_1 day_2 day_3 day_4 count
0 1/9/2018 Tom Blue 27 28 45 30 14
1 1/10/2018 Stan Red 29 13 16 5 13
2 1/11/2018 Rob Green 18 7 3 4 21
I want to "stack" the columns that start with 'day' and to do so I created a separate temporary dataframe with just those columns, and then stacked them via stack()
temp_df = df.loc[:,['day_1','day_2','day_3','day_4', 'count']]
temp_df = temp_df.stack() # this is now a Series, NOT a DataFrame
print(temp_df)
0 day_1 27
day_2 28
day_3 45
day_4 30
count 14
1 day_1 29
day_2 13
day_3 16
day_4 5
count 13
2 day_1 18
day_2 7
day_3 3
day_4 4
count 21
Now what I would like to do, which I can't seem to figure out and would really appreciate some help on, is to now merge this Series of stacked data back into the original dataframe so that I get the following:
>>>final_df
date name favorite_color time_frame value
0 1/9/2018 Tom Blue day_1 27
1 1/9/2018 Tom Blue day_2 28
2 1/9/2018 Tom Blue day_3 45
3 1/9/2018 Tom Blue day_4 30
4 1/9/2018 Tom Blue count 14
5 1/10/2018 Stan Red day_1 29
6 1/10/2018 Stan Red day_2 13
7 1/10/2018 Stan Red day_3 16
8 1/10/2018 Stan Red day_4 5
9 1/10/2018 Stan Red count 13
10 1/11/2018 Rob Green day_1 18
11 1/11/2018 Rob Green day_2 7
12 1/11/2018 Rob Green day_3 3
13 1/11/2018 Rob Green day_4 4
14 1/11/2018 Rob Green count 21
Any pointers on this or suggestions for a better approach entirely would be greatly appreciated!
IIUC wide_to_long
pd.wide_to_long(df,'day',i=['date','name','favorite_color'],j='days',sep='_').\
rename(columns={'day':'value'}).\
reset_index()
Out[1002]:
date name favorite_color days value
0 1/9/2018 Tom Blue 1 27
1 1/9/2018 Tom Blue 2 28
2 1/9/2018 Tom Blue 3 45
3 1/9/2018 Tom Blue 4 30
4 1/10/2018 Stan Red 1 29
5 1/10/2018 Stan Red 2 13
6 1/10/2018 Stan Red 3 16
7 1/10/2018 Stan Red 4 5
8 1/11/2018 Rob Green 1 18
9 1/11/2018 Rob Green 2 7
10 1/11/2018 Rob Green 3 3
11 1/11/2018 Rob Green 4 4
Update
tempdf= df.drop('count',1)
df1=pd.wide_to_long(tempdf,'day',i=['date','name','favorite_color'],j='days',sep='_').\
rename(columns={'day':'value'}).\
reset_index()
df2=df.set_index(['date','name','favorite_color'])[['count']].stack().reset_index().rename(columns={'level_3':'days',0:'value'})
pd.concat([df1,df2])
Out[24]:
date name favorite_color days value
0 1/9/2018 Tom Blue 1 27
1 1/9/2018 Tom Blue 2 28
2 1/9/2018 Tom Blue 3 45
3 1/9/2018 Tom Blue 4 30
4 1/10/2018 Stan Red 1 29
5 1/10/2018 Stan Red 2 13
6 1/10/2018 Stan Red 3 16
7 1/10/2018 Stan Red 4 5
8 1/11/2018 Rob Green 1 18
9 1/11/2018 Rob Green 2 7
10 1/11/2018 Rob Green 3 3
11 1/11/2018 Rob Green 4 4
0 1/9/2018 Tom Blue count 14
1 1/10/2018 Stan Red count 13
2 1/11/2018 Rob Green count 21