Aggregate functions on a 3-level pandas grupby object - python

I want to make a new df with simple metrics like mean, sum, min, max calculated on the Value column in the df visible below, grouped by ID, Date and Key.
index
ID
Key
Date
Value
x
y
z
0
655
321
2021-01-01
50
546
235
252345
1
675
321
2021-01-01
50
345
345
34545
2
654
356
2021-02-02
70
345
346
543
I am doing it like this:
final = df.groupby(['ID','Date','Key'])['Value'].first().mean(level=[0,1]).reset_index().rename(columns={'Value':'Value_Mean'})
I use .first() because one Key can occur multiple times in the df but they all have the same Value. I want to aggregate on ID and Date so I am using level=[0,1].
and then I am adding next metrics with pandas merge as:
final = final.merge(df.groupby(['ID','Date','Key'])['Value'].first().max(level=[0,1]).reset_index().rename(columns={'Value':'Value_Max'}), on=['ID','Date'])
And I go like that with other metrics. I wonder if there is a more sophisticated way to do it than repeat it in multiple lines. I know that you can use .agg() and pass a dict with functions but it seems like in that way it isn't possible to specify the level which is important here.

Use DataFrame.drop_duplicates with named aggregation:
df = pd.DataFrame({'ID':[655,655,655,675,654], 'Key':[321,321,333,321,356],
'Date':['2021-01-01','2021-01-01','2021-01-01','2021-01-01','2021-02-02'],
'Value':[50,30,10,50,70]})
print (df)
ID Key Date Value
0 655 321 2021-01-01 50
1 655 321 2021-01-01 30
2 655 333 2021-01-01 10
3 675 321 2021-01-01 50
4 654 356 2021-02-02 70
final = (df.drop_duplicates(['ID','Date','Key'])
.groupby(['ID','Date'], as_index=False).agg(Value_Mean=('Value','mean'),
Value_Max=('Value','max')))
print (final)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50
final = (df.groupby(['ID','Date','Key'], as_index=False)
.first()
.groupby(['ID','Date'], as_index=False).agg(Value_Mean=('Value','mean'),
Value_Max=('Value','max')))
print (final)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50
df = (df.groupby(['ID','Date','Key'], as_index=False)
.first()
.groupby(['ID','Date'], as_index=False)['Value']
.agg(['mean', 'max'])
.add_prefix('Value_')
.reset_index())
print (df)
ID Date Value_Mean Value_Max
0 654 2021-02-02 70 70
1 655 2021-01-01 30 50
2 675 2021-01-01 50 50

Related

pandas how to get mean value of datetime timestamp with some conditions?

I have a df ,you can have it by copy and run the following code:
import pandas as pd
from io import StringIO
df = """
b_id duration1 duration2 user
366 NaN 38 days 22:05:06.807430 Test
367 0 days 00:00:05.285239 NaN Test
368 NaN NaN Test
371 NaN NaN Test
378 NaN 451 days 14:59:28.830482 Test
384 28 days 21:05:16.141263 0 days 00:00:44.999706 Test
466 NaN 38 days 22:05:06.807430 Tom
467 0 days 00:00:05.285239 NaN Tom
468 NaN NaN Tom
471 NaN NaN Tom
478 NaN 451 days 14:59:28.830482 Tom
484 28 days 21:05:16.141263 0 days 00:00:44.999706 Tom
"""
df= pd.read_csv(StringIO(df.strip()), sep='\s\s+', engine='python')
df
My question is ,how can I get the mean value of each duration of each user ?
The output should something like this(the mean value is a fake one for sample ,not the exactly mean value):
mean_duration1 mean_duration2 user
8 days 22:05:06.807430 3 days 22:05:06.807430 Test
2 days 00:00:05.285239 4 days 22:05:06.807430 Tom
You can use:
out = (df
.set_index('user')
.filter(like='duration')
.apply(pd.to_timedelta)
.groupby(level=0).mean()
.reset_index()
)
Output:
user duration1 duration2
0 Test 14 days 10:32:40.713251 163 days 12:21:46.879206
1 Tom 14 days 10:32:40.713251 163 days 12:21:46.879206

Repeat rows and iterate time python

I need help with this. I want to repeat the previous row and iterate time to one second before the next row based on a condition. That is, when the indicator >= 1, repeat previous row and iterate time to the next time as shown in the output. Below is my code.
The Input:
b0
b1
time
Indicator
325
350
10:59:40
0
200
42333
10:59:45
1
This is what I was thinking...
Code:
To sort and filter the data
df_new = df #new dataframe
new_index = 0 #to keep track of index in df_new inside the loop
for i,row in df.iterrows():
new_row = {}
if (row['indicator'] > 1) and (i!=0):
for column in df.columns:
if column == 'time':
new_row[column] = row[column] + datetime.timedelta(seconds=-1)
else:
new_row[column] = prev_row[column]
The problem I'm having is iterating the time. At the moment, it is only doing it for the second before the next timestamp.
The Output:
b0
b1
time
Indicator
325
350
10:59:40
0
325
350
10:59:41
0
325
350
10:59:42
0
325
350
10:59:43
0
325
350
10:59:44
0
200
42333
10:59:45
1
This was achieved by using several functions instead of loop processing. First, use asfrq() to fill in the holes at one-second intervals. The interval value = 0 in the created data frame and the interval value = 1 in the original data frame are concatenated and sorted into a time series.
import pandas as pd
import numpy as np
import io
data = '''
b0 b1 time Indicator
325 350 10:59:40 0
200 42333 10:59:45 1
424 236 11:00:00 0
525 361 11:00:10 0
623 896 11:00:20 1
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
df['time'] = pd.to_datetime(df['time'])
df.set_index('time', inplace=True)
dfs = df.asfreq('1s', method='ffill')
dfs = pd.concat([dfs[dfs['Indicator'] == 0], df[df['Indicator'] == 1]], axis=0)
dfs.sort_values(by='time', ascending=True, inplace=True)
dfs
b0 b1 Indicator
time
2021-02-06 10:59:40 325 350 0
2021-02-06 10:59:41 325 350 0
2021-02-06 10:59:42 325 350 0
2021-02-06 10:59:43 325 350 0
2021-02-06 10:59:44 325 350 0
2021-02-06 10:59:45 200 42333 1
2021-02-06 11:00:00 424 236 0
2021-02-06 11:00:01 424 236 0
2021-02-06 11:00:02 424 236 0
2021-02-06 11:00:03 424 236 0
2021-02-06 11:00:04 424 236 0
2021-02-06 11:00:05 424 236 0
2021-02-06 11:00:06 424 236 0
2021-02-06 11:00:07 424 236 0
2021-02-06 11:00:08 424 236 0
2021-02-06 11:00:09 424 236 0
2021-02-06 11:00:10 525 361 0
2021-02-06 11:00:11 525 361 0
2021-02-06 11:00:12 525 361 0
2021-02-06 11:00:13 525 361 0
2021-02-06 11:00:14 525 361 0
2021-02-06 11:00:15 525 361 0
2021-02-06 11:00:16 525 361 0
2021-02-06 11:00:17 525 361 0
2021-02-06 11:00:18 525 361 0
2021-02-06 11:00:19 525 361 0
2021-02-06 11:00:20 623 896 1
Here's another way to do it.
Step 1: convert time column to datetime data type
df['time'] = pd.to_datetime(df['time'],format='%H:%M:%S')
Step 2: Get the time difference between next row and current row.
Convert NaN to 0, and finally convert the value to integer.
df['time_diff'] = (df.time.shift(-1) - df.time).dt.seconds.fillna(0).astype(int)
Step 3: Get the next row Indicator using shift (-1). Replacing NaN
with 0
df['next_ind'] = df.Indicator.shift(-1).fillna(0).astype(int)
Step 4: If current row Indicator is >=1, ignore the row by setting
time_diff to 1. We will NOT repeat this row. Value is set to 1.
df.loc[df.Indicator >= 1, 'time_diff'] = 1
Step 5: Similarly check if current row is 0 and next row is also 0.
If so, we will also need to ignore this row. We will NOT repeat this
row. Value is set to 1
df.loc[(df.Indicator == 0) & (df.next_ind == 0), 'time_diff'] = 1
Step 6: All values in time_diff will have a 1 or the time difference
between rows with 0 and next row with >=1. This will be used as a
value to repeat the rows. Create a list so we can explode.
df['time'] = df.apply(lambda x: list(pd.date_range(x['time'], periods=x['time_diff'], freq=pd.DateOffset(seconds=1))),axis=1)
Step 7: Now explode the dataframe on column time as it contains
lists.
df = df.explode('time')
Step 8: Print the final dataframe for desired results
Putting all this together, the code is as shown below.
c = ['b0','b1','time','Indicator']
d = [[325,350,'10:59:40',0],
[200,42333,'10:59:45',1],
[300,1234,'10:59:52',0],
[400,2345,'10:59:55',0],
[500,3456,'10:59:58',1],
[600,4567,'11:00:03',2]]
import pandas as pd
df = pd.DataFrame(d,columns=c)
print (df)
df['time'] = pd.to_datetime(df['time'],format='%H:%M:%S')
df['time_diff'] = (df.time.shift(-1) - df.time).dt.seconds.fillna(0).astype(int)
df['next_ind'] = df.Indicator.shift(-1).fillna(0).astype(int)
df.loc[df.Indicator >= 1, 'time_diff'] = 1
df.loc[(df.Indicator == 0) & (df.next_ind == 0), 'time_diff'] = 1
df['time'] = df.apply(lambda x: list(pd.date_range(x['time'], periods=x['time_diff'], freq=pd.DateOffset(seconds=1))),axis=1)
df = df.explode('time')
df.drop(columns=['time_diff','next_ind'],inplace=True)
print (df)
Output of this will be:
Original DataFrame:
b0 b1 time Indicator
0 325 350 10:59:40 0
1 200 42333 10:59:45 1
2 300 1234 10:59:52 0
3 400 2345 10:59:55 0
4 500 3456 10:59:58 1
5 600 4567 11:00:03 2
Updated DataFrame:
b0 b1 time Indicator
0 325 350 1900-01-01 10:59:40 0
0 325 350 1900-01-01 10:59:41 0
0 325 350 1900-01-01 10:59:42 0
0 325 350 1900-01-01 10:59:43 0
0 325 350 1900-01-01 10:59:44 0
1 200 42333 1900-01-01 10:59:45 1
2 300 1234 1900-01-01 10:59:52 0
3 400 2345 1900-01-01 10:59:55 0
3 400 2345 1900-01-01 10:59:56 0
3 400 2345 1900-01-01 10:59:57 0
4 500 3456 1900-01-01 10:59:58 1
5 600 4567 1900-01-01 11:00:03 2

Pandas Collapse and Stack Multi-level columns

I want to break down multi level columns and have them as a column value.
Original data input (excel):
As read in dataframe:
Company Name Company code 2017-01-01 00:00:00 Unnamed: 3 Unnamed: 4 Unnamed: 5 2017-02-01 00:00:00 Unnamed: 7 Unnamed: 8 Unnamed: 9 2017-03-01 00:00:00 Unnamed: 11 Unnamed: 12 Unnamed: 13
0 NaN NaN Product A Product B Product C Product D Product A Product B Product C Product D Product A Product B Product C Product D
1 Company A #123 1 5 3 5 0 2 3 4 0 1 2 3
2 Company B #124 600 208 30 20 600 213 30 15 600 232 30 12
3 Company C #125 520 112 47 15 520 110 47 10 520 111 47 15
4 Company D #126 420 165 120 31 420 195 120 30 420 182 120 58
Intended data frame:
I have tried stack() and unstack() and also swap level, but I couldn't get the dates column to 'drop as row'. Looks like the merged cells in excels will produce NaN as in the dataframes - and if its the columns that is merged, I will have a unnamed column. How do I work around it? Am I missing something really simple here?
Using stack
df.stack(level=0).reset_index(level=1)

pandas groupby one column and then groupby another column

I have a df,
code id amount
BB10 531 20
BB10 531 30
BB10 532 50
BR11 631 10
BR11 632 5
IN20 781 10
IN20 781 20
IN20 781 30
I want to first groupby df using code and get the total amount within each group,
df.groupby('code')['amount'].agg('sum')
then I like to know the percentage of amount for a specific id within a specific code group, e.g. for 531 its amount is 50 within BB10, with a amount percentage of 50%; the result df should look like,
code id amount pct
BB10 531 50 50%
BB10 532 50 50%
BR11 631 10 66.7%
BR11 632 5 33.3%
IN20 781 60 100%
First aggregate by both columns sum, then get total per code divide amount, multiple by 100 and round:
df1 = df.groupby(['code','id'], as_index=False)['amount'].sum()
df1['pct']=df1['amount'].div(df1.groupby('code')['amount'].transform('sum')).mul(100).round(1)
print (df1)
code id amount pct
0 BB10 531 50 50.0
1 BB10 532 50 50.0
2 BR11 631 10 66.7
3 BR11 632 5 33.3
4 IN20 781 60 100.0
Last if need percentages convert values to strings and add %:
df1['pct'] = df1['pct'].astype(str) + '%'
print (df1)
code id amount pct
0 BB10 531 50 50.0%
1 BB10 532 50 50.0%
2 BR11 631 10 66.7%
3 BR11 632 5 33.3%
4 IN20 781 60 100.0%

How do I reshape this DataFrame in Python?

I have a DataFrame df_sale in Python that I want to reshape, count the sum across the price column and add a new coloumn total. Below is the df_sale:
b_no a_id price c_id
120 24 50 2
120 56 100 2
120 90 25 2
120 45 20 2
231 89 55 3
231 45 20 3
231 10 250 3
Excepted Output after reshaping:
b_no a_id_1 a_id_2 a_id_3 a_id_4 total c_id
120 24 56 90 45 195 2
231 89 45 10 0 325 3
What I have tried so far is use the sum() on df_sale['price'] separately for 120 and 231. I do not understand how should I reshape the data, add new column headers and get the total without being computationally inefficient. Thanks.
This might not be the cleanest method (at all), but it gets the outcome you want:
reshaped_df = (df.groupby('b_no')[['price', 'c_id']]
.first()
.join(df.groupby('b_no')['a_id']
.apply(list)
.apply(pd.Series)
.add_prefix('a_id_'))
.drop('price',1)
.join(df.groupby('b_no')['price'].sum().to_frame('total'))
.fillna(0))
>>> reshaped_df
c_id a_id_0 a_id_1 a_id_2 a_id_3 total
b_no
120 2 24.0 56.0 90.0 45.0 195
231 3 89.0 45.0 10.0 0.0 325
You can achieve this grouping by b_no and c_id, summing total, and flattening a_id:
import pandas as pd
d = {"b_no": [120,120,120,120,231,231, 231],
"a_id": [24,56,90,45,89,45,10],
"price": [50,100,25,20,55,20,250],
"c_id": [2,2,2,2,3,3,3]}
df = pd.DataFrame(data=d)
df2 = df.groupby(['b_no', 'c_id'])['a_id'].apply(list).apply(pd.Series).add_prefix('a_id_').fillna(0)
df2["total"] = df.groupby(['b_no', 'c_id'])['price'].sum()
print(df2)
a_id_0 a_id_1 a_id_2 a_id_3 total
b_no c_id
120 2 24.0 56.0 90.0 45.0 195
231 3 89.0 45.0 10.0 0.0 325

Categories