How to write to the cell below level 0 of a multiindex

How to write to the cell below level 0 of a multiindex - python

I've got a df with a MultiIndex like so
nums = np.arange(5)
key = ['kfc'] * 5
mi = pd.MultiIndex.from_arrays([key,nums])
df = pd.DataFrame({'rent': np.arange(10,60,10)})
df.set_index(mi)
rent
kfc 0 10
1 20
2 30
3 40
4 50
How can I write to the cell below kfc, I want to add meta info e.g. The address or the monthly rent
rent
kfc 0 10
NYC 1 20
2 30
3 40
4 50

According to your expected output you would need to recreate the df MultiIndex:
df.index = pd.MultiIndex.from_tuples(zip(['kfc'] + ['NYC'] * 4, df.index.levels[1]))
print(df)
rent
kfc 0 10
NYC 1 20
2 30
3 40
4 50

Related

How to merge dataframe between dates

I have one dataframe data contains daily data of sales (DF).
I have another dataframe that contains quarterly data (DF1).
This is what the quarterly dataframe looks like DF1.
Date Computer Sale In Person Sales Net Sales
1/29/2021 1 2 3
4/30/2021 2 4 6
7/29/2021 3 6 9
1/29/2022 4 8 12
5/1/2022 5 10 15
7/30/2022 6 12 18
This is what the daily Data frame looks like: DF
Date Num of people
1 / 30 / 2021 45
1 / 31 / 2021 35
2 / 1 / 2021 25
5 / 1 / 2021 20
5 / 2 / 2021 15
I have columns Computer Sales, In Person Sales, Net Sales in the quarterly dataframe.
How to I merge the columns from above to the daily dataframe so that I can see on the daily dataframe the quarterly data. I want the final result to look like this
Date Num of people Computer Sale In Person Sales Net Sales
1/30/2021 45 1 2 3
1/31/2021 35 1 2 3
2/1/2021 25 1 2 3
5/1/2021 20 2 4 6
5/2/2021 15 2 4 6
So, for example. I want 1/30/2021 to be the figure that is 1/29/2021 and once the daily data goes past 4/30/2021 then merge the new quarterly Data.
Please let me know if I need to be more specific.

A possible solution:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
pd.merge_asof(df2, df1, on='Date', direction='backward')
Output:
Date Num of people Computer Sale In Person Sales Net Sales
0 2021-01-30 45 1 2 3
1 2021-01-31 35 1 2 3
2 2021-02-01 25 1 2 3
3 2021-05-01 20 2 4 6
4 2021-05-02 15 2 4 6

calculate the amount spend based on every month which depends on another column value ID

I am trying to get the amount spent on each type ID based on the month column
Dataset :
ID TYPE_ID Month_year Amount
100 1 jun_2019 20
100 1 jul_2019 30
100 2 jun_2019 10
200 1 jun_2019 50
200 1 jun_2019 30
100 2 jul_2019 20
200 2 jun_2019 40
200 2 jul_2019 10
200 2 jun_2019 20
200 1 jul_2019 30
100 1 jul_2019 10
Output :
Based on every type ID, I want to calculate the spend depending on the month . The column value TYPEID_1_jun2019 tells me the no of transactions done in that particular month. Amount_type1_jun2019 tells me the total amount spend in every month based on my type ID.
ID TYPEID_1_jun2019 Amount_type1_jun2019 TYPEID_1_jul2019 Amount_type1_jul2019 TYPEID_2_jun2019 Amount_type2_jun2019 TYPEID_2_jul2019 Amount_type2_jul2019
100 1 20 2 40 1 10 1 20
200 1 80 1 30 2 60 1 10
EDIT : I also want to calculate the average monthly spent for every ID
Output : Also include these columns,
ID Average_type1_jul2019 Average_type1_jun2019
100 20 10
The formula I used to calculate the average is amount spent in july with type ID 1 divided by the total months.

First convert Month_year to datetimes for correct order, then create helper column type and aggregate sum with size, reshape by DataFrame.unstack, sorting by DataFrame.sort_index and last flatten MultiIndex with datetimes to original format:
df['Month_year'] = pd.to_datetime(df['Month_year'], format='%b_%Y')
df1 = (df.assign(type=df['TYPE_ID']).groupby(['ID','Month_year','TYPE_ID'])
.agg({'Amount':'sum', 'type':'size'})
.unstack([1,2])
.sort_index(axis=1, level=[1,2]))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[2]}_{x[1].strftime("%b_%Y")}')
df1 = df1.reset_index()
print (df1)
ID Amount_1_Jun_2019 type_1_Jun_2019 Amount_2_Jun_2019 \
0 100 20 1 10
1 200 80 2 60
type_2_Jun_2019 Amount_1_Jul_2019 type_1_Jul_2019 Amount_2_Jul_2019 \
0 1 40 2 20
1 2 30 1 10
type_2_Jul_2019
0 1
1 1
EDIT:
#removed sorting anf flatteting MultiIndex
df['Month_year'] = pd.to_datetime(df['Month_year'], format='%b_%Y')
df1 = (df.assign(type=df['TYPE_ID']).groupby(['ID','Month_year','TYPE_ID'])
.agg({'Amount':'sum', 'type':'size'})
.unstack([1,2]))
print (df1)
Amount type
Month_year 2019-06-01 2019-07-01 2019-06-01 2019-07-01
TYPE_ID 1 2 1 2 1 2 1 2
ID
100 20 10 40 20 1 1 2 1
200 80 60 30 10 2 2 1 1
#get number of unique mmonth_year per ID and type and divided by Amount
df2 = df.groupby(['ID','TYPE_ID'])['Month_year'].nunique().unstack()
df3 = df1.xs('Amount', axis=1, level=0).div(df2, level=1)
#added top level Average
df3.columns = pd.MultiIndex.from_tuples([('Average', a, b) for a, b in df3.columns])
print (df3)
Average
2019-06-01 2019-07-01
1 2 1 2
ID
100 10.0 5.0 20.0 10.0
200 40.0 30.0 15.0 5.0
#join together, sorting and flatten MultiIndex
df5 = pd.concat([df1, df3],axis=1).sort_index(axis=1, level=[1,2])
df5.columns = df5.columns.map(lambda x: f'{x[0]}_{x[2]}_{x[1].strftime("%b_%Y")}')
df5 = df5.reset_index()
print (df5)
ID Amount_1_Jun_2019 Average_1_Jun_2019 type_1_Jun_2019 \
0 100 20 10.0 1
1 200 80 40.0 2
Amount_2_Jun_2019 Average_2_Jun_2019 type_2_Jun_2019 Amount_1_Jul_2019 \
0 10 5.0 1 40
1 60 30.0 2 30
Average_1_Jul_2019 type_1_Jul_2019 Amount_2_Jul_2019 Average_2_Jul_2019 \
0 20.0 2 20 10.0
1 15.0 1 10 5.0
type_2_Jul_2019
0 1
1 1

Pandas calculate aggrerage value with respect to current row

Let's say we have this data:
df = pd.DataFrame({
'group_id': [100,100,100,101,101,101,101],
'amount': [30,40,10,20,25,80,40]
})
df.index.name = 'id'
df.set_index(['group_id', df.index], inplace=True)
It looks like this:
amount
group_id id
100 0 30
1 40
2 10
101 3 20
4 25
5 80
6 40
The goal is to compute a new column, that's the sum of all amounts less than the current one. I.e. We want this result.
amount sum_of_smaller_amounts
group_id id
100 0 30 10
1 40 40 # 30 + 10
2 10 0 # smallest amount
101 3 20 0 # smallest
4 25 20
5 80 85 # 20 + 25 + 40
6 40 45 # 20 + 25
Ideally this should be (very) efficient as the real dataframe could be millions of rows.

Better solution (I think):
df['sum_smaller_amount'] = (df_sort.groupby('group_id')['amount']
.transform(lambda x: x.mask(x.duplicated(),0).cumsum()) -
df['amount'])
Output:
amount sum_smaller_amount
group_id id
100 0 30 10.0
1 40 40.0
2 10 0.0
101 3 20 0.0
4 25 20.0
5 80 85.0
6 40 45.0
Another way to do this to use a cartesian product and filter:
df.merge(df.reset_index(), on='group_id', suffixes=('_sum_smaller',''))\
.query('amount_sum_smaller < amount')\
.groupby(['group_id','id'])[['amount_sum_smaller']].sum()\
.join(df, how='right').fillna(0)
Output:
amount_sum_smaller amount
group_id id
100 0 10.0 30
1 40.0 40
2 0.0 10
101 3 0.0 20
4 20.0 25
5 85.0 80
6 45.0 40

You want sort_values and cumsum:
df['new_amount']= (df.sort_values('amount')
.groupby(level='group_id')
['amount'].cumsum() - df['amount'])
Output:
amount new_amount
group_id id
100 0 30 10
1 40 40
2 10 0
101 3 20 0
4 25 20
5 80 85
6 40 45
Update: fix for repeated values:
# the data
df = pd.DataFrame({
'group_id': [100,100,100,100,101,101,101,101],
'amount': [30,40,10,30,20,25,80,40]
})
df.index.name = 'id'
df.set_index(['group_id', df.index], inplace=True)
# sort values:
df_sorted = df.sort_values('amount')
# cumsum
s1 = df_sorted.groupby('group_id')['amount'].cumsum()
# value counts
s2 = df_sorted.groupby(['group_id', 'amount']).cumcount() + 1
# instead of just subtracting df['amount'], we subtract amount * counts
df['new_amount'] = s1 - df['amount'].mul(s2)
Output (note the two values 30 in group 100)
amount new_amount
group_id id
100 0 30 10
1 40 70
2 10 0
3 30 10
101 4 20 0
5 25 20
6 80 85
7 40 45

I'm intermediate on pandas, not sure on efficiency but here's a solution:
temp_df = df.sort_values(['group_id','amount'])
temp_df = temp_df.mask(temp_df['amount'] == temp_df['amount'].shift(), other=0).groupby(level='group_id').cumsum()
df['sum'] = temp_df.sort_index(level='id')['amount'] - df['amount']
Result:
amount sum
group_id id
100 0 30 10
1 40 40
2 10 0
101 3 20 0
4 25 20
5 80 85
6 40 45
7 40 45
You can substitute the last line with these if they help efficiency somehow:
df['sum'] = df.subtract(temp_df).multiply(-1)
# or
df['sum'] = (~df).add(temp_df + 1)

pandas get average of a groupby

I am trying to find the average monthly cost per user_id but i am only able to get average cost per user or monthly cost per user.
Because i group by user and month, there is no way to get the average of the second groupby (month) unless i transform the groupby output to something else.
This is my df:
df = { 'id' : pd.Series([1,1,1,1,2,2,2,2]),
'cost' : pd.Series([10,20,30,40,50,60,70,80]),
'mth': pd.Series([3,3,4,5,3,4,4,5])}
cost id mth
0 10 1 3
1 20 1 3
2 30 1 4
3 40 1 5
4 50 2 3
5 60 2 4
6 70 2 4
7 80 2 5
I can get monthly sum but i want the average of the months for each user_id.
df.groupby(['id','mth'])['cost'].sum()
id mth
1 3 30
4 30
5 40
2 3 50
4 130
5 80
i want something like this:
id average_monthly
1 (30+30+40)/3
2 (50+130+80)/3

Resetting the index should work. Try this:
In [19]: df.groupby(['id', 'mth']).sum().reset_index().groupby('id').mean()
Out[19]:
mth cost
id
1 4.0 33.333333
2 4.0 86.666667
You can just drop mth if you want. The logic is that after the sum part, you have this:
In [20]: df.groupby(['id', 'mth']).sum()
Out[20]:
cost
id mth
1 3 30
4 30
5 40
2 3 50
4 130
5 80
Resetting the index at this point will give you unique months.
In [21]: df.groupby(['id', 'mth']).sum().reset_index()
Out[21]:
id mth cost
0 1 3 30
1 1 4 30
2 1 5 40
3 2 3 50
4 2 4 130
5 2 5 80
It's just a matter of grouping it again, this time using mean instead of sum. This should give you the averages.
Let us know if this helps.

df_monthly_average = (
df.groupby(["InvoiceMonth", "InvoiceYear"])["Revenue"]
.sum()
.reset_index()
.groupby("Revenue")
.mean()
.reset_index()
)

How to get top 2 per multi index in Pandas dataframe (generated by pivot_table)

I would like to show the top 2 results per the first 2 levels of a 3 level indexed dataframe (coming through pivot_table)
import pandas as pd
df = pd.DataFrame([[2015,1,'A','R1',70],
[2015,2,'B','R2',40],
[2015,3,'C','R3',20],
[2015,1,'D','R2',90],
[2015,2,'A','R1',30],
[2015,3,'A','R3',20],
[2015,1,'B','R2',50],
[2015,2,'C','R1',90],
[2015,3,'B','R3',10],
[2015,1,'C','R3',10]],
columns = ['year','month','profile','ranking','sales'])
# create a pivot that sums the sales, sorts the profiles by total sales per year, month and profile
df.pivot_table(values = 'sales',
index = ['year','month','profile'],
columns = ['ranking'],
aggfunc = 'sum',
fill_value = 0,
margins = True).sort_values(by = 'All',ascending = False).sort_index(level=[0,1], sort_remaining=False)
Question 1: how to get only the top two profiles per year month combination?
so
for: 2015,1: D & A
for: 2015,2: C & B
for: 2015,3: A & C
Bonus question:
How to get the sums for the non top 2 profiles and call them 'Other'
so
for: 2015,1: Other,0,50,10,60 (which is the sum of B&C)
for: 2015,2: Other,30,0,0,30 (which is A only in this case)
for: 2015,3: Other,0,0,10,10 (which is B only in this case)
I would like to have it returned as a dataframe to me

UPDATE:
without pivoting:
In [120]: srt = df.sort_values(['year','month','profile'])
In [123]: srt[srt.groupby(['year','month'])['profile'].rank(method='min') <= 2]
Out[123]:
year month profile ranking sales
0 2015 1 A R1 70
6 2015 1 B R2 50
4 2015 2 A R1 30
1 2015 2 B R2 40
5 2015 3 A R3 20
8 2015 3 B R3 10
Bonus answer:
In [131]: srt[srt.groupby(['year','month'])['profile'] \
.rank(method='min') >= 2] \
.groupby(['year','month']).agg({'sales':'sum'})
Out[131]:
sales
year month
2015 1 150
2 130
3 30
With pivoting: you can try to reset index after pivoting:
In [109]: pvt = df.pivot_table(values = 'sales',
.....: index = ['year','month','profile'],
.....: columns = ['ranking'],
.....: aggfunc = 'sum',
.....: fill_value = 0,
.....: margins = True).reset_index()
In [111]: pvt
Out[111]:
ranking year month profile R1 R2 R3 All
0 2015 1 A 70 0 0 70
1 2015 1 B 0 50 0 50
2 2015 1 C 0 0 10 10
3 2015 1 D 0 90 0 90
4 2015 2 A 30 0 0 30
5 2015 2 B 0 40 0 40
6 2015 2 C 90 0 0 90
7 2015 3 A 0 0 20 20
8 2015 3 B 0 0 10 10
9 2015 3 C 0 0 20 20
10 All 190 180 60 430
Now you can use rank() method:
In [110]: pvt[pvt.sort_values(['year','month','profile']).groupby(['year','month'])['profile'].rank(method='min') <= 2]
Out[110]:
ranking year month profile R1 R2 R3 All
0 2015 1 A 70 0 0 70
1 2015 1 B 0 50 0 50
4 2015 2 A 30 0 0 30
5 2015 2 B 0 40 0 40
7 2015 3 A 0 0 20 20
8 2015 3 B 0 0 10 10
10 All 190 180 60 430
Ranking itself:
In [112]: pvt.sort_values(['year','month','profile']).groupby(['year','month'])['profile'].rank(method='min')
Out[112]:
0 1
1 2
2 3
3 4
4 1
5 2
6 3
7 1
8 2
9 3
10 1
dtype: float64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write to the cell below level 0 of a multiindex - python

According to your expected output you would need to recreate the df MultiIndex: df.index = pd.MultiIndex.from_tuples(zip(['kfc'] + ['NYC'] * 4, df.index.levels[1])) print(df) rent kfc 0 10 NYC 1 20 2 30 3 40 4 50

Related

How to merge dataframe between dates

calculate the amount spend based on every month which depends on another column value ID

Pandas calculate aggrerage value with respect to current row

pandas get average of a groupby

How to get top 2 per multi index in Pandas dataframe (generated by pivot_table)

Categories

Resources