Pandas Rolling mean with GroupBy and Sort - python

I have a DataFrame that looks like:
f_period f_year f_month subject month year value
20140102 2014 1 a 1 2018 10
20140109 2014 1 a 1 2018 12
20140116 2014 1 a 1 2018 8
20140202 2014 2 a 1 2018 20
20140209 2014 2 a 1 2018 15
20140102 2014 1 b 1 2018 10
20140109 2014 1 b 1 2018 12
20140116 2014 1 b 1 2018 8
20140202 2014 2 b 1 2018 20
20140209 2014 2 b 1 2018 15
The f_period is the date when a forecast for a SKU (column subject) was made. The month and year column is the period for which the forecast was made. For example, the first row says that on 01/02/2018, the model was forecasting to set 10 units of product a in month 1 of year2018.
I am trying to create a rolling average prediction by subject, by month for 2 f_months. The DataFrame should look like:
f_period f_year f_month subject month year value mnthly_avg rolling_2_avg
20140102 2014 1 a 1 2018 10 10 13
20140109 2014 1 a 1 2018 12 10 13
20140116 2014 1 a 1 2018 8 10 13
20140202 2014 2 a 1 2018 20 17.5 null
20140209 2014 2 a 1 2018 15 17.5 null
20140102 2014 1 b 1 2018 10 10 13
20140109 2014 1 b 1 2018 12 10 13
20140116 2014 1 b 1 2018 8 10 13
20140202 2014 2 b 1 2018 20 17.5 null
20140209 2014 2 b 1 2018 15 17.5 null
Things I tried:
I was able to get mnthly_avg by :
data_df['monthly_avg'] = data_df.groupby(['f_month', 'f_year', 'year', 'month', 'period', 'subject']).\
value.transform('mean')
I tried getting the rolling_2_avg :
rolling_monthly_df = data_df[['f_year', 'f_month', 'subject', 'month', 'year', 'value', 'f_period']].\
groupby(['f_year', 'f_month', 'subject', 'month', 'year']).value.mean().reset_index()
rolling_monthly_df['rolling_2_avg'] = rolling_monthly_df.groupby(['subject', 'month']).\
value.rolling(2).mean().reset_index(drop=True)
This gave me an unexpected output. I don't understand how it calculated the values for rolling_2_avg
How do I group by subject and month and then sort by f_month and then take the average of the next two-month average?

Unless I'm misunderstanding it seems simpler than what you've done. What about this?
grp = pd.DataFrame(df.groupby(['subject', 'month', 'f_month'])['value'].sum())
grp['rolling'] = grp.rolling(window=2).mean()
grp
Output:
value rolling
subject month f_month
a 1 1 30 NaN
2 35 32.5
b 1 1 30 32.5
2 35 32.5

I would be a bit careful with Josh's solution. If you want to group by the subject you can't use the rolling function like that as it will roll across subjects (i.e. it will eventually take the mean of a month from subject A and B, rather than giving a null which you might prefer).
An alternative can be to split the dataframe and run the rolling individually (I noticed that you want the nulls by the end of the dataframe, whereas you might wanna sort the dataframe before and after):
for unique_subject in df['subject'].unique():
df_subject = df[df['subject'] == unique_subject]
df_subject['rolling'] = df_subject['value'].rolling(window=2).mean()
print(df_subject) # just to print, you may wanna concatenate these

Related

How do I create a new column that references other row's data for its values?

I have the following data frame:
Month
Day
Year
Open
High
Low
Close
Week
0
1
1
2003
46.593
46.656
46.405
46.468
1
1
1
2
2003
46.538
46.66
46.47
46.673
1
2
1
3
2003
46.717
46.781
46.53
46.750
1
3
1
4
2003
46.815
46.843
46.68
46.750
1
4
1
5
2003
46.935
47.000
46.56
46.593
1
...
...
...
...
...
...
...
...
...
7257
10
26
2022
381.619
387.5799
381.350
382.019
43
7258
10
27
2022
383.07
385.00
379.329
379.98
43
7259
10
28
2022
379.869
389.519
379.67
389.019
43
7260
10
31
2022
386.44
388.399
385.26
386.209
44
7261
11
1
2022
390.14
390.39
383.29
384.519
44
I want to create a new column titled 'week high' which will reference each week every year and pull in the high. So for Week 1, Year 2003, it will take the Highest High from rows 0 to 4 but for Week 43, Year 2022, it will take the Highest High from rows 7257 to 7259.
Is it possible to reference the columns Week and Year to calculate that value? Thanks!
Assuming pandas, create a weekly period and use it as grouper for transform('max'):
group = pd.to_datetime(df[['Year', 'Month', 'Day']]).dt.to_period('W')
# or, if you already have a "Week" column
# group = "Week"
df['week_high'] = df.groupby(group)['High'].transform('max')
Output:
Month Day Year Open High Low Close Week week_high
0 1 1 2003 46.593 46.6560 46.405 46.468 1.0 47.000
1 1 2 2003 46.538 46.6600 46.470 46.673 1.0 47.000
2 1 3 2003 46.717 46.7810 46.530 46.750 1.0 47.000
3 1 4 2003 46.815 46.8430 46.680 46.750 1.0 47.000
4 1 5 2003 46.935 47.0000 46.560 46.593 1.0 47.000
7257 10 26 2022 381.619 387.5799 381.350 382.019 43.0 389.519
7258 10 27 2022 383.070 385.0000 379.329 379.980 43.0 389.519
7259 10 28 2022 379.869 389.5190 379.670 389.019 43.0 389.519
7260 10 31 2022 386.440 388.3990 385.260 386.209 44.0 390.390
7261 11 1 2022 390.140 390.3900 383.290 384.519 44 390.390
I am assuming you are using pandas. Other libraries will work similar.
Create a new DataFrame aggregated per week using groupby and join it back to your original DataFrame
df_grouped = df["Week", "High"].groupby("Week").max().rename(columns={"High":"Highest High"}
df_result = df.join(df_grouped, "Week")

use pandas to tet previous year sales in the same row

I have a table from different companies' sales.
company_name sales year
A 200 2019
A 100 2018
A 30 2017
B 15 2019
B 30 2018
B 45 2017
Now, I want to add a previous year's sales in the same row just like
company_name sales year previous_sales
A 200 2019 100
A 100 2018 30
A 30 2017 Nan
B 15 2019 30
B 30 2018 45
B 45 2017 Nan
I tried to use the code like this, but I failed to get the right result
df["previous_sales"] = df.groupby(['company_name', 'year'])['sales'].shift()

How to calculate cumulative sum in python using pandas of all the columns except the first one that contain names?

Here's the data in csv format:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
Jack 1 15 25 3 5 11 5 8 3
Jill 5 10 32 5 5 14 6 8 7
I don't want Name column to be include as it gives an error.
I tried
df.cumsum()
Try with set_index and reset_index to keep the name column:
df.set_index('Name').cumsum().reset_index()
Output:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Jack 1 15 25 3 5 11 5 8 3
1 Jill 6 25 57 8 10 25 11 16 10

Group (sum) by Month, Year and another Variable in Python

I'm quite new to programming, and I'm using Python it for data manipulation and analysis.
I have a dataframe that looks like:
Brand Date Unit
A 1/1/19 10
B 3/1/19 11
A 11/1/19 15
B 11/1/19 5
A 1/1/20 10
A 9/2/19 18
B 12/2/19 11
B 19/2/19 8
B 1/1/20 5
And I would like to group by month, year and Brand. If it helps, I also have separate columns for Month and Year. The expected result should look like this:
Brand Date Unit
A Jan 2019 25
B Jan 2019 16
A Feb 2019 18
B Feb 2019 19
A Jan 2020 8
B Feb 2020 5
I tried adapting an answer from someone else's question:
per = df.Date.dt.to_period("M")
g = df.groupby(per,'Brand')
g.sum()
but I get prompted:
ValueError: No axis named Brand for object type <class 'pandas.core.frame.DataFrame'>
and I don't have any idea how to solve this.
I used to do this with dictionaries by selecting each month/year individually, group by sum and then create the dictionary, but it seems kind of brute force, really rough and it won't help if the df gets updated with new data.
Even more, maybe I'm having a bad approach to the situation. In the end I'd like to have a df looking like:
Brand Jan 19 Feb 19 Jan 20
A 25 18 8
B 16 19 5
Use pandas.to_datetime and pandas.DataFrame.pivot_table:
df["Date"] = pd.to_datetime(df["Date"], dayfirst=True).dt.strftime("%b %Y")
new_df = df.pivot_table(index="Brand", columns="Date", aggfunc=sum)
print(new_df)
Output:
Unit
Date Feb 2019 Jan 2019 Jan 2020
Brand
A 18 25 10
B 19 16 5
You were close, DataFrame.groupby wants a list of groupers, not bare arguments.
Here's how I did it:
import pandas
from io import StringIO
csv = StringIO("""\
Brand Date Unit
A 1/1/19 10
B 3/1/19 11
A 11/1/19 15
B 11/1/19 5
A 1/1/20 10
A 9/2/19 18
B 12/2/19 11
B 19/2/19 8
B 1/1/20 5
""")
(
pandas.read_csv(csv, parse_dates=['Date'], sep='\s+', dayfirst=True)
.groupby(['Brand', pandas.Grouper(key='Date', freq='1M')])
.sum()
.reset_index()
)
And that gives me:
Brand Date Unit
0 A 2019-01-31 25
1 A 2019-02-28 18
2 A 2020-01-31 10
3 B 2019-01-31 16
4 B 2019-02-28 19
5 B 2020-01-31 5

calculating mean and sum in pivot_table in pandas sorted by two separate desired col values

I have a data set from 2015-2018 which has months and days as 2nd and third col like below:
Year Month Day rain temp humidity snow
2015 1 1 0 20 60 0
2015 1 2 2 18 58 0
2015 1 3 0 20 62 2
2015 1 4 5 15 62 0
2015 1 5 2 18 61 1
2015 1 6 0 19 60 2
2015 1 7 3 20 59 0
2015 1 8 2 17 65 0
2015 1 9 1 17 61 0
I wanted to use pivot_table to calculate something like (the mean of temperature for year 2016 and months (1,2,3)
I was wondering if anyone could help me with this?
You can do with pd.cut then groupby
df.temp.groupby([df.Year,pd.cut(df.Month,[0,3,6,9,12],labels=['Winter','Spring','Summer','Autumn'],right =False)]).mean()
Out[93]:
Year Month
2015 Winter 18.222222

Categories