I need some help with groupby and expanding mean in python pandas.
I am trying to use pandas expanding mean and groupby. In this image below, I want to group by using the id column and expand mean by date. But the catch is for January I am not using expanding mean. For example, you can think January might be a past month and take the overall mean of the value column and grouping by ids.
For February and March I want to use expanding mean of value column on top of January. So for 7 Feb and id 1, the 44.5 in expanding mean column is basically mean of January before the value of 89 occurs today. The next value for id 1 is on 7-Mar which is inclusive of previous value of 89 on 7 Feb for id = 1.
So basically my idea is taking the overall mean upto Feb 1, and then use expanding mean on top of whatever mean has been calculated upto that date.
id date value count(prior) expanding mean (after feb)
1 1-Jan 28 4 44.75
2 1-Jan 43 3 37.33
3 1-Jan 69 3 57.00
1 2-Jan 31 4 44.75
2 2-Jan 22 3 37.33
1 7-Jan 82 4 44.75
2 7-Jan 47 3 37.33
3 7-Jan 79 3 57.00
1 8-Jan 38 4 44.75
3 8-Jan 23 3 57.00
1 7-Feb 89 4 44.75
2 7-Feb 22 3 37.33
3 7-Feb 80 3 57.00
2 19-Feb 91 4 33.50
3 19-Feb 97 4 62.75
1 7-Mar 48 5 53.60
2 7-Mar 98 5 45.00
3 7-Mar 35 5 69.60
I've given the count columns as a reference to how the count is increasing. It basically means everything prior to that date.
Related
I've got a dataframe with two columns one is datetime dataframe consisting of dates, and another one consists of quantity. It looks like something like this,
Date Quantity
0 2019-01-05 10
1 2019-01-10 15
2 2019-01-22 14
3 2019-02-03 12
4 2019-05-11 25
5 2019-05-21 4
6 2019-07-08 1
7 2019-07-30 15
8 2019-09-05 31
9 2019-09-10 44
10 2019-09-25 8
11 2019-12-09 10
12 2020-04-11 111
13 2020-04-17 5
14 2020-06-05 17
15 2020-06-16 12
16 2020-06-22 14
I want to make another dataframe. It should consist of two columns one is Month/Year and the other is Till Highest. I basically want to calculate the highest quantity value until that month and group it using month/year. Example of what I want precisely is,
Month/Year Till Highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In my case, the dataset is vast, and I've readings of almost every day of each month and each year in the specified timeline. Here I've made a dummy dataset to show an example of what I want.
Please help me with this. Thanks in advance :)
See the annotated code:
(df
# convert date to monthly period (2019-01)
.assign(Date=pd.to_datetime(df['Date']).dt.to_period('M'))
# period and max quantity per month
.groupby('Date')
.agg(**{'Month/Year': ('Date', 'first'),
'Till highest': ('Quantity', 'max')})
# format periods as Jan/2019 and get cumulated max quantity
.assign(**{
'Month/Year': lambda d: d['Month/Year'].dt.strftime('%b/%Y'),
'Till highest': lambda d: d['Till highest'].cummax()
})
# drop the groupby index
.reset_index(drop=True)
)
output:
Month/Year Till highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In R you can use cummax:
df=data.frame(Date=c("2019-01-05","2019-01-10","2019-01-22","2019-02-03","2019-05-11","2019-05-21","2019-07-08","2019-07-30","2019-09-05","2019-09-10","2019-09-25","2019-12-09","2020-04-11","2020-04-17","2020-06-05","2020-06-16","2020-06-22"),Quantity=c(10,15,14,12,25,4,1,15,31,44,8,10,111,5,17,12,14))
data.frame(`Month/Year`=unique(format(as.Date(df$Date),"%b/%Y")),
`Till Highest`=cummax(tapply(df$Quantity,sub("-..$","",df$Date),max)),
check.names=F,row.names=NULL)
Month/Year Till Highest
1 Jan/2019 15
2 Feb/2019 15
3 May/2019 25
4 Jul/2019 25
5 Sep/2019 44
6 Dec/2019 44
7 Apr/2020 111
8 Jun/2020 111
I have a pivot table. Columns represent years, rows month. I want to create two tables containing the percent changes between every value and its counterpart for the previous month.
I have managed to create a pivot table with the percentage changes, but logically, data is missing for January.
Instead, I would like to compare January with December, i.e. the last row of the previous column.
Thank you in advance.
df = pd.DataFrame(np.random.randint(0,100,size=(12, 3)), columns=('2016', '2017', '2018'))
df.index.name = 'month'
df.index = df.index +1
print(df)
2016 2017 2018
month
1 49 98 7
2 72 60 67
3 64 71 53
4 71 75 91
5 68 96 48
6 35 21 54
7 14 98 3
8 62 38 64
9 68 92 58
10 64 95 94
11 54 81 8
12 86 18 90
my current solution:
df_month_pctchange = df.pct_change(axis=0).mul(100)
print(df_month_pctchange)
2016 2017 2018
month
1 NaN NaN NaN
2 46.94 -38.78 857.14
3 -11.11 18.33 -20.90
4 10.94 5.63 71.70
5 -4.23 28.00 -47.25
6 -48.53 -78.12 12.50
7 -60.00 366.67 -94.44
8 342.86 -61.22 2033.33
9 9.68 142.11 -9.38
10 -5.88 3.26 62.07
11 -15.62 -14.74 -91.49
12 59.26 -77.78 1025.00
Desired result:
2016 2017 2018
month
1 NaN 7.35 -61.11
2 46.94 -38.78 857.14
3 -11.11 18.33 -20.90
4 10.94 5.63 71.70
5 -4.23 28.00 -47.25
6 -48.53 -78.12 12.50
7 -60.00 366.67 -94.44
8 342.86 -61.22 2033.33
9 9.68 142.11 -9.38
10 -5.88 3.26 62.07
11 -15.62 -14.74 -91.49
12 59.26 -77.78 1025.00
you can select both first and last row of df with iloc, use shift on the last row to report value from 2016 to 2017 and so on, and do the calculation. Then set the result in the first row of df_month_pctchange
# your code
df_month_pctchange = df.pct_change(axis=0).mul(100)
# what to add to fill the missing values
df_month_pctchange.iloc[0] = (df.iloc[0]/df.iloc[-1].shift()-1)*100
print(df_month_pctchange)
# 2016 2017 2018
# month
# 1 NaN 13.953488 -61.111111 # note it is 13.95 and not 7.35 in 2017
# 2 46.938776 -38.775510 857.142857
# 3 -11.111111 18.333333 -20.895522
# 4 10.937500 5.633803 71.698113
# 5 -4.225352 28.000000 -47.252747
# 6 -48.529412 -78.125000 12.500000
# 7 -60.000000 366.666667 -94.444444
# 8 342.857143 -61.224490 2033.333333
# 9 9.677419 142.105263 -9.375000
# 10 -5.882353 3.260870 62.068966
# 11 -15.625000 -14.736842 -91.489362
# 12 59.259259 -77.777778 1025.000000
I can calculate the average in a for loop but that doesn't seem an efficient solution. So consider the following DataFrame:
Index Numbers
1 12
2 19
3 47
4 78
5 32
6 63
7 89
I want to calculate the average of every number after the 4th value for the above four values and store it in an adjacent column. So the expected output is:
Index Numbers Average
1 12
2 19
3 47
4 78 39
5 32 44
6 63 55
7 89 65.5
So average of first four numbers i.e. Index (1 to 4) is 39, next (2 to 5) is 44 and so on. Is there an efficient way to do this? Thanks.
Use Series.rolling with mean:
df['Average'] = df['Numbers'].rolling(4).mean()
print (df)
Index Numbers Average
0 1 12 NaN
1 2 19 NaN
2 3 47 NaN
3 4 78 39.0
4 5 32 44.0
5 6 63 55.0
6 7 89 65.5
Possible functions implemented for rolling:
Rolling.count
Rolling.sum
Rolling.mean
Rolling.median
Rolling.var
Rolling.std
Rolling.min
Rolling.max
Rolling.corr
Rolling.cov
Rolling.skew
Rolling.kurt
Rolling.apply
Rolling.aggregate
Rolling.quantile
I would like to create a table of relative start dates using the output of a Pandas pivot table. The columns of the pivot table are months, the rows are accounts, and the cells are a running total of actions. For example:
Date1 Date2 Date3 Date4
1 1 2 3
N/A 1 2 2
The first row's first instance is Date1.
The second row's first instance is Date2.
The new table would be formatted such that the columns are now the months relative to the first action and would look like:
FirstMonth SecondMonth ThirdMonth
1 1 2
1 2 2
Creating the initial pivot table is strightforward in pandas, I'm curious if there are any suggestion for how to develop the table of relative starting points. Thank you!
First, make sure your dataframe columns are actual datetime values. Then you can run the following to calculate the sum of actions for each date and then group those values by month and calculate the corresponding monthly sum:
>>>df
2019-01-01 2019-01-02 2019-02-01
Row
0 4 22 40
1 22 67 86
2 72 27 25
3 0 26 60
4 44 62 32
5 73 86 81
6 81 17 58
7 88 29 21
>>>df.sum().groupby(df.sum().index.month).sum()
1 720
2 403
And if you want it to reflect what you had above:
>>> out = df.sum().groupby(df.sum().index.month).sum().to_frame().T
>>> out.columns = [datetime.datetime.strftime(datetime.datetime.strptime(str(x),'%m'),'%B') for x in out.columns]
>>> out
January February
0 720 403
And if I misunderstood you, and you want it broken out by record / row:
>>> df.T.groupby(df.T.index.month).sum().T
1 2
Row
0 26 40
1 89 86
2 99 25
3 26 60
4 106 32
5 159 81
6 98 58
7 117 21
Rename the columns as above.
The trick is to use .apply() combined with dropna().
df.T.apply(lambda x: pd.Series(x.dropna().values)).T
I have a grouped dataframe
id num week
101 23 7 3
8 1
9 2
102 34 8 4
9 1
10 2
...
And I need to create new columns and have a dataFrame like this
id num 7 8 9 10
101 23 3 1 2 0
102 34 0 4 1 2
...
As you may see, the values of the week column turned into several columns.
I may also have the input dataFrame not grouped, or with reset_index, like this:
id num week
101 23 7 3
101 23 8 1
101 23 9 2
102 34 8 4
102 34 9 1
102 34 10 2
...
but I don't know with which would be easier to start.
Notice that id and num are both keys
Use unstack() and fillna(0) to not have NaNs.
Let's load the data:
id num week val
101 23 7 3
101 23 8 1
101 23 9 2
102 34 8 4
102 34 9 1
102 34 10 2
s = pd.read_clipboard(index_col=[0,1,2], squeeze=True)
Notice I have set the index to be id, num and week. If you haven't yet, use set_index.
Now we can unstack: move from the index (rows) to the columns. By default it does it to the last level in line, which is week here, but you could specify it using level=-1 or level='week'
s.unstack().fillna(0)
Note that as pointed out by #piRsquared you can do s.unstack(fill_value=0) to do it in one go.