How to write to the cell below level 0 of a multiindex - python

I've got a df with a MultiIndex like so
nums = np.arange(5)
key = ['kfc'] * 5
mi = pd.MultiIndex.from_arrays([key,nums])
df = pd.DataFrame({'rent': np.arange(10,60,10)})
df.set_index(mi)
rent
kfc 0 10
1 20
2 30
3 40
4 50
How can I write to the cell below kfc, I want to add meta info e.g. The address or the monthly rent
rent
kfc 0 10
NYC 1 20
2 30
3 40
4 50

According to your expected output you would need to recreate the df MultiIndex:
df.index = pd.MultiIndex.from_tuples(zip(['kfc'] + ['NYC'] * 4, df.index.levels[1]))
print(df)
rent
kfc 0 10
NYC 1 20
2 30
3 40
4 50

Related

How to merge dataframe between dates

I have one dataframe data contains daily data of sales (DF).
I have another dataframe that contains quarterly data (DF1).
This is what the quarterly dataframe looks like DF1.
Date Computer Sale In Person Sales Net Sales
1/29/2021 1 2 3
4/30/2021 2 4 6
7/29/2021 3 6 9
1/29/2022 4 8 12
5/1/2022 5 10 15
7/30/2022 6 12 18
This is what the daily Data frame looks like: DF
Date Num of people
1 / 30 / 2021 45
1 / 31 / 2021 35
2 / 1 / 2021 25
5 / 1 / 2021 20
5 / 2 / 2021 15
I have columns Computer Sales, In Person Sales, Net Sales in the quarterly dataframe.
How to I merge the columns from above to the daily dataframe so that I can see on the daily dataframe the quarterly data. I want the final result to look like this
Date Num of people Computer Sale In Person Sales Net Sales
1/30/2021 45 1 2 3
1/31/2021 35 1 2 3
2/1/2021 25 1 2 3
5/1/2021 20 2 4 6
5/2/2021 15 2 4 6
So, for example. I want 1/30/2021 to be the figure that is 1/29/2021 and once the daily data goes past 4/30/2021 then merge the new quarterly Data.
Please let me know if I need to be more specific.
A possible solution:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
pd.merge_asof(df2, df1, on='Date', direction='backward')
Output:
Date Num of people Computer Sale In Person Sales Net Sales
0 2021-01-30 45 1 2 3
1 2021-01-31 35 1 2 3
2 2021-02-01 25 1 2 3
3 2021-05-01 20 2 4 6
4 2021-05-02 15 2 4 6

calculate the amount spend based on every month which depends on another column value ID

I am trying to get the amount spent on each type ID based on the month column
Dataset :
ID TYPE_ID Month_year Amount
100 1 jun_2019 20
100 1 jul_2019 30
100 2 jun_2019 10
200 1 jun_2019 50
200 1 jun_2019 30
100 2 jul_2019 20
200 2 jun_2019 40
200 2 jul_2019 10
200 2 jun_2019 20
200 1 jul_2019 30
100 1 jul_2019 10
Output :
Based on every type ID, I want to calculate the spend depending on the month . The column value TYPEID_1_jun2019 tells me the no of transactions done in that particular month. Amount_type1_jun2019 tells me the total amount spend in every month based on my type ID.
ID TYPEID_1_jun2019 Amount_type1_jun2019 TYPEID_1_jul2019 Amount_type1_jul2019 TYPEID_2_jun2019 Amount_type2_jun2019 TYPEID_2_jul2019 Amount_type2_jul2019
100 1 20 2 40 1 10 1 20
200 1 80 1 30 2 60 1 10
EDIT : I also want to calculate the average monthly spent for every ID
Output : Also include these columns,
ID Average_type1_jul2019 Average_type1_jun2019
100 20 10
The formula I used to calculate the average is amount spent in july with type ID 1 divided by the total months.
First convert Month_year to datetimes for correct order, then create helper column type and aggregate sum with size, reshape by DataFrame.unstack, sorting by DataFrame.sort_index and last flatten MultiIndex with datetimes to original format:
df['Month_year'] = pd.to_datetime(df['Month_year'], format='%b_%Y')
df1 = (df.assign(type=df['TYPE_ID']).groupby(['ID','Month_year','TYPE_ID'])
.agg({'Amount':'sum', 'type':'size'})
.unstack([1,2])
.sort_index(axis=1, level=[1,2]))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[2]}_{x[1].strftime("%b_%Y")}')
df1 = df1.reset_index()
print (df1)
ID Amount_1_Jun_2019 type_1_Jun_2019 Amount_2_Jun_2019 \
0 100 20 1 10
1 200 80 2 60
type_2_Jun_2019 Amount_1_Jul_2019 type_1_Jul_2019 Amount_2_Jul_2019 \
0 1 40 2 20
1 2 30 1 10
type_2_Jul_2019
0 1
1 1
EDIT:
#removed sorting anf flatteting MultiIndex
df['Month_year'] = pd.to_datetime(df['Month_year'], format='%b_%Y')
df1 = (df.assign(type=df['TYPE_ID']).groupby(['ID','Month_year','TYPE_ID'])
.agg({'Amount':'sum', 'type':'size'})
.unstack([1,2]))
print (df1)
Amount type
Month_year 2019-06-01 2019-07-01 2019-06-01 2019-07-01
TYPE_ID 1 2 1 2 1 2 1 2
ID
100 20 10 40 20 1 1 2 1
200 80 60 30 10 2 2 1 1
#get number of unique mmonth_year per ID and type and divided by Amount
df2 = df.groupby(['ID','TYPE_ID'])['Month_year'].nunique().unstack()
df3 = df1.xs('Amount', axis=1, level=0).div(df2, level=1)
#added top level Average
df3.columns = pd.MultiIndex.from_tuples([('Average', a, b) for a, b in df3.columns])
print (df3)
Average
2019-06-01 2019-07-01
1 2 1 2
ID
100 10.0 5.0 20.0 10.0
200 40.0 30.0 15.0 5.0
#join together, sorting and flatten MultiIndex
df5 = pd.concat([df1, df3],axis=1).sort_index(axis=1, level=[1,2])
df5.columns = df5.columns.map(lambda x: f'{x[0]}_{x[2]}_{x[1].strftime("%b_%Y")}')
df5 = df5.reset_index()
print (df5)
ID Amount_1_Jun_2019 Average_1_Jun_2019 type_1_Jun_2019 \
0 100 20 10.0 1
1 200 80 40.0 2
Amount_2_Jun_2019 Average_2_Jun_2019 type_2_Jun_2019 Amount_1_Jul_2019 \
0 10 5.0 1 40
1 60 30.0 2 30
Average_1_Jul_2019 type_1_Jul_2019 Amount_2_Jul_2019 Average_2_Jul_2019 \
0 20.0 2 20 10.0
1 15.0 1 10 5.0
type_2_Jul_2019
0 1
1 1

Pandas calculate aggrerage value with respect to current row

Let's say we have this data:
df = pd.DataFrame({
'group_id': [100,100,100,101,101,101,101],
'amount': [30,40,10,20,25,80,40]
})
df.index.name = 'id'
df.set_index(['group_id', df.index], inplace=True)
It looks like this:
amount
group_id id
100 0 30
1 40
2 10
101 3 20
4 25
5 80
6 40
The goal is to compute a new column, that's the sum of all amounts less than the current one. I.e. We want this result.
amount sum_of_smaller_amounts
group_id id
100 0 30 10
1 40 40 # 30 + 10
2 10 0 # smallest amount
101 3 20 0 # smallest
4 25 20
5 80 85 # 20 + 25 + 40
6 40 45 # 20 + 25
Ideally this should be (very) efficient as the real dataframe could be millions of rows.
Better solution (I think):
df['sum_smaller_amount'] = (df_sort.groupby('group_id')['amount']
.transform(lambda x: x.mask(x.duplicated(),0).cumsum()) -
df['amount'])
Output:
amount sum_smaller_amount
group_id id
100 0 30 10.0
1 40 40.0
2 10 0.0
101 3 20 0.0
4 25 20.0
5 80 85.0
6 40 45.0
Another way to do this to use a cartesian product and filter:
df.merge(df.reset_index(), on='group_id', suffixes=('_sum_smaller',''))\
.query('amount_sum_smaller < amount')\
.groupby(['group_id','id'])[['amount_sum_smaller']].sum()\
.join(df, how='right').fillna(0)
Output:
amount_sum_smaller amount
group_id id
100 0 10.0 30
1 40.0 40
2 0.0 10
101 3 0.0 20
4 20.0 25
5 85.0 80
6 45.0 40
You want sort_values and cumsum:
df['new_amount']= (df.sort_values('amount')
.groupby(level='group_id')
['amount'].cumsum() - df['amount'])
Output:
amount new_amount
group_id id
100 0 30 10
1 40 40
2 10 0
101 3 20 0
4 25 20
5 80 85
6 40 45
Update: fix for repeated values:
# the data
df = pd.DataFrame({
'group_id': [100,100,100,100,101,101,101,101],
'amount': [30,40,10,30,20,25,80,40]
})
df.index.name = 'id'
df.set_index(['group_id', df.index], inplace=True)
# sort values:
df_sorted = df.sort_values('amount')
# cumsum
s1 = df_sorted.groupby('group_id')['amount'].cumsum()
# value counts
s2 = df_sorted.groupby(['group_id', 'amount']).cumcount() + 1
# instead of just subtracting df['amount'], we subtract amount * counts
df['new_amount'] = s1 - df['amount'].mul(s2)
Output (note the two values 30 in group 100)
amount new_amount
group_id id
100 0 30 10
1 40 70
2 10 0
3 30 10
101 4 20 0
5 25 20
6 80 85
7 40 45
I'm intermediate on pandas, not sure on efficiency but here's a solution:
temp_df = df.sort_values(['group_id','amount'])
temp_df = temp_df.mask(temp_df['amount'] == temp_df['amount'].shift(), other=0).groupby(level='group_id').cumsum()
df['sum'] = temp_df.sort_index(level='id')['amount'] - df['amount']
Result:
amount sum
group_id id
100 0 30 10
1 40 40
2 10 0
101 3 20 0
4 25 20
5 80 85
6 40 45
7 40 45
You can substitute the last line with these if they help efficiency somehow:
df['sum'] = df.subtract(temp_df).multiply(-1)
# or
df['sum'] = (~df).add(temp_df + 1)

pandas get average of a groupby

I am trying to find the average monthly cost per user_id but i am only able to get average cost per user or monthly cost per user.
Because i group by user and month, there is no way to get the average of the second groupby (month) unless i transform the groupby output to something else.
This is my df:
df = { 'id' : pd.Series([1,1,1,1,2,2,2,2]),
'cost' : pd.Series([10,20,30,40,50,60,70,80]),
'mth': pd.Series([3,3,4,5,3,4,4,5])}
cost id mth
0 10 1 3
1 20 1 3
2 30 1 4
3 40 1 5
4 50 2 3
5 60 2 4
6 70 2 4
7 80 2 5
I can get monthly sum but i want the average of the months for each user_id.
df.groupby(['id','mth'])['cost'].sum()
id mth
1 3 30
4 30
5 40
2 3 50
4 130
5 80
i want something like this:
id average_monthly
1 (30+30+40)/3
2 (50+130+80)/3
Resetting the index should work. Try this:
In [19]: df.groupby(['id', 'mth']).sum().reset_index().groupby('id').mean()
Out[19]:
mth cost
id
1 4.0 33.333333
2 4.0 86.666667
You can just drop mth if you want. The logic is that after the sum part, you have this:
In [20]: df.groupby(['id', 'mth']).sum()
Out[20]:
cost
id mth
1 3 30
4 30
5 40
2 3 50
4 130
5 80
Resetting the index at this point will give you unique months.
In [21]: df.groupby(['id', 'mth']).sum().reset_index()
Out[21]:
id mth cost
0 1 3 30
1 1 4 30
2 1 5 40
3 2 3 50
4 2 4 130
5 2 5 80
It's just a matter of grouping it again, this time using mean instead of sum. This should give you the averages.
Let us know if this helps.
df_monthly_average = (
df.groupby(["InvoiceMonth", "InvoiceYear"])["Revenue"]
.sum()
.reset_index()
.groupby("Revenue")
.mean()
.reset_index()
)

How to get top 2 per multi index in Pandas dataframe (generated by pivot_table)

I would like to show the top 2 results per the first 2 levels of a 3 level indexed dataframe (coming through pivot_table)
import pandas as pd
df = pd.DataFrame([[2015,1,'A','R1',70],
[2015,2,'B','R2',40],
[2015,3,'C','R3',20],
[2015,1,'D','R2',90],
[2015,2,'A','R1',30],
[2015,3,'A','R3',20],
[2015,1,'B','R2',50],
[2015,2,'C','R1',90],
[2015,3,'B','R3',10],
[2015,1,'C','R3',10]],
columns = ['year','month','profile','ranking','sales'])
# create a pivot that sums the sales, sorts the profiles by total sales per year, month and profile
df.pivot_table(values = 'sales',
index = ['year','month','profile'],
columns = ['ranking'],
aggfunc = 'sum',
fill_value = 0,
margins = True).sort_values(by = 'All',ascending = False).sort_index(level=[0,1], sort_remaining=False)
Question 1: how to get only the top two profiles per year month combination?
so
for: 2015,1: D & A
for: 2015,2: C & B
for: 2015,3: A & C
Bonus question:
How to get the sums for the non top 2 profiles and call them 'Other'
so
for: 2015,1: Other,0,50,10,60 (which is the sum of B&C)
for: 2015,2: Other,30,0,0,30 (which is A only in this case)
for: 2015,3: Other,0,0,10,10 (which is B only in this case)
I would like to have it returned as a dataframe to me
UPDATE:
without pivoting:
In [120]: srt = df.sort_values(['year','month','profile'])
In [123]: srt[srt.groupby(['year','month'])['profile'].rank(method='min') <= 2]
Out[123]:
year month profile ranking sales
0 2015 1 A R1 70
6 2015 1 B R2 50
4 2015 2 A R1 30
1 2015 2 B R2 40
5 2015 3 A R3 20
8 2015 3 B R3 10
Bonus answer:
In [131]: srt[srt.groupby(['year','month'])['profile'] \
.rank(method='min') >= 2] \
.groupby(['year','month']).agg({'sales':'sum'})
Out[131]:
sales
year month
2015 1 150
2 130
3 30
With pivoting: you can try to reset index after pivoting:
In [109]: pvt = df.pivot_table(values = 'sales',
.....: index = ['year','month','profile'],
.....: columns = ['ranking'],
.....: aggfunc = 'sum',
.....: fill_value = 0,
.....: margins = True).reset_index()
In [111]: pvt
Out[111]:
ranking year month profile R1 R2 R3 All
0 2015 1 A 70 0 0 70
1 2015 1 B 0 50 0 50
2 2015 1 C 0 0 10 10
3 2015 1 D 0 90 0 90
4 2015 2 A 30 0 0 30
5 2015 2 B 0 40 0 40
6 2015 2 C 90 0 0 90
7 2015 3 A 0 0 20 20
8 2015 3 B 0 0 10 10
9 2015 3 C 0 0 20 20
10 All 190 180 60 430
Now you can use rank() method:
In [110]: pvt[pvt.sort_values(['year','month','profile']).groupby(['year','month'])['profile'].rank(method='min') <= 2]
Out[110]:
ranking year month profile R1 R2 R3 All
0 2015 1 A 70 0 0 70
1 2015 1 B 0 50 0 50
4 2015 2 A 30 0 0 30
5 2015 2 B 0 40 0 40
7 2015 3 A 0 0 20 20
8 2015 3 B 0 0 10 10
10 All 190 180 60 430
Ranking itself:
In [112]: pvt.sort_values(['year','month','profile']).groupby(['year','month'])['profile'].rank(method='min')
Out[112]:
0 1
1 2
2 3
3 4
4 1
5 2
6 3
7 1
8 2
9 3
10 1
dtype: float64

Categories