How to calculate the average count per hour? - python
I have the following DataFrame df and I want to calculate the average number of entries per hour over the year, grouped by runway
year month day hour runway
2017 12 30 10 32L
2017 12 30 11 32L
2017 12 30 11 32L
2017 12 30 11 32L
2017 12 30 11 30R
2018 12 30 10 32L
2018 12 30 10 32L
2018 12 30 11 32L
2018 12 30 11 32L
The expected result is this one:
year runway avg. count per hour
2017 32L 2
2017 30R 0.5
2018 32L 2
2018 32L 0
I tried this, but it does not calculate the average count per hour:
result = df.groupby(['year','runway']).count()
Here's one way of achieving it i.e
#Take the count of unique hours per year
s = df.groupby(['year'])['hour'].nunique()
# Take the count of the the runway
n = df.groupby(['year','runway']).size().reset_index()
# Divide them
n['avg'] = n[0]/n['year'].map(s)
year runway 0 avg
0 2017 30R 1 0.5
1 2017 32L 4 2.0
2 2018 32L 4 2.0
Related
Trying to get a forecast for 7 day period using Moving average logic
I am trying to get the moving averages for a window period of 7 days and also forecast the moving averages for the next 7 day period Like a 7 day forecast using a 7 day moving average. Dates actual prices Forecast 1 October 5400 2 October 5375 3 October 5350 4 October 5325 5 October 5300 6 October 5275 7 October 5250 8 October 5250 9 October 5200 10 October 5175 11 October 5150 12 October 5125 13 October 5100 13 October 5325 14 October 5303.571429 15 October 5278.571429 16 October 5253.571429 17 October 5228.571429 18 October 5203.571429 19 October 5178.571429 I am not able to forecast using the .rolling(7).mean() function or the .expanding() function. df['7Day_Rolling_avg']=df['actual price'].rolling(window=7).mean() The above code gives me the rolling average but I am not able to forecast the values for the 7 day window period. Sample Image 1 Sample Image 2
use pandas to tet previous year sales in the same row
I have a table from different companies' sales. company_name sales year A 200 2019 A 100 2018 A 30 2017 B 15 2019 B 30 2018 B 45 2017 Now, I want to add a previous year's sales in the same row just like company_name sales year previous_sales A 200 2019 100 A 100 2018 30 A 30 2017 Nan B 15 2019 30 B 30 2018 45 B 45 2017 Nan I tried to use the code like this, but I failed to get the right result df["previous_sales"] = df.groupby(['company_name', 'year'])['sales'].shift()
Pandas Rolling mean with GroupBy and Sort
I have a DataFrame that looks like: f_period f_year f_month subject month year value 20140102 2014 1 a 1 2018 10 20140109 2014 1 a 1 2018 12 20140116 2014 1 a 1 2018 8 20140202 2014 2 a 1 2018 20 20140209 2014 2 a 1 2018 15 20140102 2014 1 b 1 2018 10 20140109 2014 1 b 1 2018 12 20140116 2014 1 b 1 2018 8 20140202 2014 2 b 1 2018 20 20140209 2014 2 b 1 2018 15 The f_period is the date when a forecast for a SKU (column subject) was made. The month and year column is the period for which the forecast was made. For example, the first row says that on 01/02/2018, the model was forecasting to set 10 units of product a in month 1 of year2018. I am trying to create a rolling average prediction by subject, by month for 2 f_months. The DataFrame should look like: f_period f_year f_month subject month year value mnthly_avg rolling_2_avg 20140102 2014 1 a 1 2018 10 10 13 20140109 2014 1 a 1 2018 12 10 13 20140116 2014 1 a 1 2018 8 10 13 20140202 2014 2 a 1 2018 20 17.5 null 20140209 2014 2 a 1 2018 15 17.5 null 20140102 2014 1 b 1 2018 10 10 13 20140109 2014 1 b 1 2018 12 10 13 20140116 2014 1 b 1 2018 8 10 13 20140202 2014 2 b 1 2018 20 17.5 null 20140209 2014 2 b 1 2018 15 17.5 null Things I tried: I was able to get mnthly_avg by : data_df['monthly_avg'] = data_df.groupby(['f_month', 'f_year', 'year', 'month', 'period', 'subject']).\ value.transform('mean') I tried getting the rolling_2_avg : rolling_monthly_df = data_df[['f_year', 'f_month', 'subject', 'month', 'year', 'value', 'f_period']].\ groupby(['f_year', 'f_month', 'subject', 'month', 'year']).value.mean().reset_index() rolling_monthly_df['rolling_2_avg'] = rolling_monthly_df.groupby(['subject', 'month']).\ value.rolling(2).mean().reset_index(drop=True) This gave me an unexpected output. I don't understand how it calculated the values for rolling_2_avg How do I group by subject and month and then sort by f_month and then take the average of the next two-month average?
Unless I'm misunderstanding it seems simpler than what you've done. What about this? grp = pd.DataFrame(df.groupby(['subject', 'month', 'f_month'])['value'].sum()) grp['rolling'] = grp.rolling(window=2).mean() grp Output: value rolling subject month f_month a 1 1 30 NaN 2 35 32.5 b 1 1 30 32.5 2 35 32.5
I would be a bit careful with Josh's solution. If you want to group by the subject you can't use the rolling function like that as it will roll across subjects (i.e. it will eventually take the mean of a month from subject A and B, rather than giving a null which you might prefer). An alternative can be to split the dataframe and run the rolling individually (I noticed that you want the nulls by the end of the dataframe, whereas you might wanna sort the dataframe before and after): for unique_subject in df['subject'].unique(): df_subject = df[df['subject'] == unique_subject] df_subject['rolling'] = df_subject['value'].rolling(window=2).mean() print(df_subject) # just to print, you may wanna concatenate these
Grouping data series by day intervals with Pandas
I have to perform some data analysis on a seasonal basis. I have circa one and a half years worth of hourly measurements, from the end of 2015 to the second half of 2017. What I want to do is to sort this data in seasons. Here's an example of the data I am working with: Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2 04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2 19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4 19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1 11/06/2016,2016,6,11,7,19,0,7,1395,837,18.8 11/06/2016,2016,6,11,7,20,0,7,1370,822,17.4 11/06/2016,2016,6,11,7,21,0,7,1364,818,17 11/06/2016,2016,6,11,7,22,0,7,1433,860,17.5 04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6 04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4 07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2 07/03/2017,2017,3,7,3,15,0,3,3666,2200,14 24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4 24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5 24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4 15/06/2017,2017,6,15,5,13,1,1,2590,1554,22.5 15/06/2017,2017,6,15,5,14,1,1,2629,1577,22.5 15/06/2017,2017,6,15,5,15,1,1,2656,1594,22.1 15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6 15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9 15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8 15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3 As you can see I have data on three different years. What I was thinking to do is to convert the first column with the pd.to_datetime() command. Then to group the rows according to the day/month, regardless of the year in dd/mm intervals (if winter goes from the 21/12 to the 21/03, create a new dataframe with all of those rows in which the date is included in this interval, regardless of the year), but I couldn't do it by neglecting the year (which make things more complicated). EDIT: A desired output would be: df_spring Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4 19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1 07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2 07/03/2017,2017,3,7,3,15,0,3,3666,2200,14 24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4 24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5 24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4 df_autumn Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2 04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2 04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6 04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4 15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6 15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9 15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8 15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3 And so on for the remaining seasons.
Define each season by filtering the relevant rows using Day and Month columns as presented for winter: df_winter = df.loc[((df['Day'] >= 21) & (df['Month'] == 12)) | (df['Month'] == 1) | (df['Month'] == 2) | ((df['Day'] <= 21) & (df['Month'] == 3))]
you can simply filter your dataframe by month.isin() # spring df[df['Month'].isin([3,4])] Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C) 2 19/04/2016 2016 4 19 3 3 0 3 1348 809 14.4 3 19/04/2016 2016 4 19 3 4 0 3 1353 812 14.1 10 07/03/2017 2017 3 7 3 14 0 3 3668 2201 14.2 11 07/03/2017 2017 3 7 3 15 0 3 3666 2200 14.0 12 24/04/2017 2017 4 24 2 5 0 2 1347 808 11.4 13 24/04/2017 2017 4 24 2 6 0 2 1816 1090 11.5 14 24/04/2017 2017 4 24 2 7 0 2 2918 1751 12.4 # autumn df[df['Month'].isin([11,12])] Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C) 0 04/12/2015 2015 12 4 6 18 0 6 2968 1781 16.2 1 04/12/2015 2015 12 4 6 19 0 6 2437 1462 16.2 8 04/12/2016 2016 12 4 1 17 0 1 1425 855 14.6 9 04/12/2016 2016 12 4 1 18 0 1 1466 880 14.4 18 15/11/2017 2017 11 15 4 13 0 4 3765 2259 15.6 19 15/11/2017 2017 11 15 4 14 0 4 3873 2324 15.9 20 15/11/2017 2017 11 15 4 15 0 4 3905 2343 15.8 21 15/11/2017 2017 11 15 4 16 0 4 3861 2317 15.3
How to get average hourly number of entries?
I have the following DataFrame df and I want to calculate the average hourly number of entries per day, grouped by runway year month day hour runway 2017 12 30 10 32L 2017 12 30 11 32L 2017 12 30 11 32L 2017 12 30 11 32L 2017 12 30 11 30R 2018 12 31 10 32L 2018 12 31 10 32L 2018 12 31 11 32L 2018 12 31 11 32L The expected result is this one: hour avg. count per hour 10 1.5 11 3 If I group by hour and do size, I get the total count of entries per hour. But how can I get the average number of entries per hour? df.groupby("hour").size() I tried something like this, but it fails with the error: s = df.groupby(["hour"])["month","day"].nunique() df_arr = asma_df.groupby(["hour"]).size().reset_index() df_arr[0]/df_arr["hour"].map(s) UPDATE: The indicated duplicate question is different from mine. I am asking about the average hourly count, not the total hourly count. Therefore it is not helpful.
I think need assign to new column avg output of division, what is Series: s = df.groupby(["hour"])["day"].nunique() df_arr = df.groupby(["hour"]).size().reset_index(name='avg') df_arr['avg'] /= df_arr["hour"].map(s) #alternative #df_arr = df_arr.assign(avg = df_arr['avg'] / df_arr["hour"].map(s)) print (df_arr) hour avg 0 10 1.5 1 11 3.0 Or divide Series and last creare DataFrame by reset_index: g = df.groupby(["hour"])["day"] df_arr = g.size().div(g.nunique()).reset_index(name='avg') print (df_arr) hour avg 0 10 1.5 1 11 3.0 And solution for check values for mean: df_arr = df.groupby(["hour"])["day"].agg(['size','nunique']) df_arr['avg'] = df_arr['size'] / df_arr['nunique'] print (df_arr) size nunique avg hour 10 3 2 1.5 11 6 2 3.0