Pandas mapping with multiple conditions - python
I have a dataframe consisting of Year, Month, Temperature. Now, I need to create seasonal means, such as DJF (Dec, Jan, Feb), MAM (Mar, Apr, May), JJA (Jun, Jul, Aug), SON (Sep, Oct, Nov).
But how can I take into account the fact that DJF should have December from the previous year, January and February of the subsequent year?
This is the code I have so far:
z = {1: 'DJF', 2: 'DJF', 3: 'MAM', 4: 'MAM', 5: 'MAM', 6: 'JJA', 7: 'JJA', 8: 'JJA', 9: 'SON', 10: 'SON',
11: 'SON', 12: 'DJF'}
df['season'] = df['Mon'].map(z)
The problem with the above coding is that when I group by year and season to calculate the means, the values for DJF will be incorrect, since they take Dec, Jan, and Feb of the same year.
df.groupby(['Year','season']).mean()
I think you can create periodindex by to_datetime and to_period
Then shift one moth and convert to Quarters by asfreq.
Last groupby by index anf aggregate mean:
df['Day'] = 1
df.index = pd.to_datetime(df[['Year','Month','Day']]).dt.to_period('M')
df = df.shift(1, freq='M').asfreq('Q')
print (df.groupby(level=0)['Temperature'].mean())
Sample:
rng = pd.date_range('2017-04-03', periods=20, freq='M')
df = pd.DataFrame({'Date': rng, 'Temperature': range(20)})
df['Year'] = df.Date.dt.year
df['Month'] = df.Date.dt.month
df = df.drop('Date', axis=1)
print (df)
Temperature Year Month
0 0 2017 4
1 1 2017 5
2 2 2017 6
3 3 2017 7
4 4 2017 8
5 5 2017 9
6 6 2017 10
7 7 2017 11
8 8 2017 12
9 9 2018 1
10 10 2018 2
11 11 2018 3
12 12 2018 4
13 13 2018 5
14 14 2018 6
15 15 2018 7
16 16 2018 8
17 17 2018 9
18 18 2018 10
19 19 2018 11
df['Day'] = 1
df.index = pd.to_datetime(df[['Year','Month','Day']]).dt.to_period('M')
df = df.shift(1, freq='M').asfreq('Q')
print (df)
Temperature Year Month Day
2017Q2 0 2017 4 1
2017Q2 1 2017 5 1
2017Q3 2 2017 6 1
2017Q3 3 2017 7 1
2017Q3 4 2017 8 1
2017Q4 5 2017 9 1
2017Q4 6 2017 10 1
2017Q4 7 2017 11 1
2018Q1 8 2017 12 1
2018Q1 9 2018 1 1
2018Q1 10 2018 2 1
2018Q2 11 2018 3 1
2018Q2 12 2018 4 1
2018Q2 13 2018 5 1
2018Q3 14 2018 6 1
2018Q3 15 2018 7 1
2018Q3 16 2018 8 1
2018Q4 17 2018 9 1
2018Q4 18 2018 10 1
2018Q4 19 2018 11 1
print (df.groupby(level=0)['Temperature'].mean())
2017Q2 0.5
2017Q3 3.0
2017Q4 6.0
2018Q1 9.0
2018Q2 12.0
2018Q3 15.0
2018Q4 18.0
Freq: Q-DEC, Name: Temperature, dtype: float64
And last if need season column:
df1 = df.groupby(level=0)['Temperature'].mean().rename_axis('per').reset_index()
z = {1: 'DJF',2: 'MAM', 3: 'JJA', 4: 'SON'}
df1['season'] = df1['per'].dt.quarter.map(z)
df1['yaer'] = df1['per'].dt.year
print (df1)
per Temperature season yaer
0 2017Q2 0.5 MAM 2017
1 2017Q3 3.0 JJA 2017
2 2017Q4 6.0 SON 2017
3 2018Q1 9.0 DJF 2018
4 2018Q2 12.0 MAM 2018
5 2018Q3 15.0 JJA 2018
6 2018Q4 18.0 SON 2018
Related
Convert a Python data frame with diferents 'year' column into continue time series
It is posible to convert a dataframe on Pandas like that: Into a time series where each year its behind the last one
This is likely what df.unstack(level=1) is meant for. np.random.seed(111) # reproducibility df = pd.DataFrame( data={ "2009": np.random.randn(12), "2010": np.random.randn(12), "2011": np.random.randn(12), }, index=range(1, 13) ) print(df) Out[45]: 2009 2010 2011 1 -1.133838 -1.440585 0.570594 2 0.384319 0.773703 0.915420 3 1.496554 -1.027967 -1.669341 4 -0.355382 -0.090986 0.482714 5 -0.787534 0.492003 -0.310473 6 -0.459439 0.424672 2.394690 7 -0.059169 1.283049 1.550931 8 -0.354174 0.315986 -0.646465 9 -0.735523 -0.408082 -0.928937 10 -1.183940 -0.067948 -1.654976 11 0.238894 -0.952427 0.350193 12 -0.589920 -0.110677 -0.141757 df_out = df.unstack(1).reset_index() df_out.columns = ["year", "month", "value"] print(df_out) Out[46]: year month value 0 2009 1 -1.133838 1 2009 2 0.384319 2 2009 3 1.496554 3 2009 4 -0.355382 4 2009 5 -0.787534 5 2009 6 -0.459439 6 2009 7 -0.059169 7 2009 8 -0.354174 8 2009 9 -0.735523 9 2009 10 -1.183940 10 2009 11 0.238894 11 2009 12 -0.589920 12 2010 1 -1.440585 13 2010 2 0.773703 14 2010 3 -1.027967 15 2010 4 -0.090986 16 2010 5 0.492003 17 2010 6 0.424672 18 2010 7 1.283049 19 2010 8 0.315986 20 2010 9 -0.408082 21 2010 10 -0.067948 22 2010 11 -0.952427 23 2010 12 -0.110677 24 2011 1 0.570594 25 2011 2 0.915420 26 2011 3 -1.669341 27 2011 4 0.482714 28 2011 5 -0.310473 29 2011 6 2.394690 30 2011 7 1.550931 31 2011 8 -0.646465 32 2011 9 -0.928937 33 2011 10 -1.654976 34 2011 11 0.350193 35 2011 12 -0.141757
How to calculate cumulative sum in python using pandas of all the columns except the first one that contain names?
Here's the data in csv format: Name 2012 2013 2014 2015 2016 2017 2018 2019 2020 Jack 1 15 25 3 5 11 5 8 3 Jill 5 10 32 5 5 14 6 8 7 I don't want Name column to be include as it gives an error. I tried df.cumsum()
Try with set_index and reset_index to keep the name column: df.set_index('Name').cumsum().reset_index() Output: Name 2012 2013 2014 2015 2016 2017 2018 2019 2020 0 Jack 1 15 25 3 5 11 5 8 3 1 Jill 6 25 57 8 10 25 11 16 10
Fill column with value from previous year from the same month
How can I use the value from the same month in the previous year to fill values in the following table for 2020: Category Month Year Value A 1 2019 15 B 2 2019 20 A 2 2019 90 A 3 2019 50 B 4 2019 40 A 5 2019 20 A 6 2019 15 A 7 2019 17 A 8 2019 18 A 9 2019 12 A 10 2019 11 A 11 2019 19 A 12 2019 15 A 1 2020 18 A 2 2020 53 A 3 2020 80 The final desired result is the following: Category Month Year Value A 1 2019 15 B 2 2019 20 A 2 2019 90 A 3 2019 50 B 4 2019 40 A 4 2019 40 A 5 2019 20 A 6 2019 15 A 7 2019 17 A 8 2019 18 A 9 2019 12 A 10 2019 11 A 11 2019 19 A 12 2019 15 A 1 2020 18 A 2 2020 53 A 3 2020 80 B 4 2020 40 A 4 2020 40 A 5 2020 20 A 6 2020 15 A 7 2020 17 A 8 2020 18 A 9 2020 12 A 10 2020 11 A 11 2020 19 A 12 2020 15 I tried using pandas groupby but not sure if that is the right approach.
IIUC we use the pivot then ffill with stack s=df.pivot_table(index=['Category','Year'],columns='Month',values='Value').groupby(level=0).ffill().stack().reset_index() Category Year level_2 0 0 A 2019 1 15.0 1 A 2019 2 90.0 2 A 2019 3 50.0 3 A 2019 5 20.0 4 A 2019 6 15.0 5 A 2019 7 17.0 6 A 2019 8 18.0 7 A 2019 9 12.0 8 A 2019 10 11.0 9 A 2019 11 19.0 10 A 2019 12 15.0 11 A 2020 1 18.0 12 A 2020 2 53.0 13 A 2020 3 80.0 14 A 2020 5 20.0 15 A 2020 6 15.0 16 A 2020 7 17.0 17 A 2020 8 18.0 18 A 2020 9 12.0 19 A 2020 10 11.0 20 A 2020 11 19.0 21 A 2020 12 15.0 22 B 2019 2 20.0 23 B 2019 4 40.0
You can accomplish this with a combination of loc, concat, and drop_duplicates. The idea here is to concatenate the dataframe with a copy of the 2019 data where year is changed to 2020, and then only keeping the first value for Category, Month, Year. df2 = df.loc[df['Year'] == 2019, :] df2['Year'] = 2020 pd.concat([df, df2]).drop_duplicates(subset=['Category', 'Month', 'Year'], keep='first') Output Category Month Year Value 0 A 1 2019 15 1 B 2 2019 20 2 A 2 2019 90 3 A 3 2019 50 4 B 4 2019 40 5 A 5 2019 20 6 A 6 2019 15 7 A 7 2019 17 8 A 8 2019 18 9 A 9 2019 12 10 A 10 2019 11 11 A 11 2019 19 12 A 12 2019 15 13 A 1 2020 18 14 A 2 2020 53 15 A 3 2020 80 1 B 2 2020 20 4 B 4 2020 40 5 A 5 2020 20 6 A 6 2020 15 7 A 7 2020 17 8 A 8 2020 18 9 A 9 2020 12 10 A 10 2020 11 11 A 11 2020 19 12 A 12 2020 15
How to use groupby and grouper properly for accumulating column 'A' and averaging column 'B', month by month
I have a pandas data with 3 columns: date: from 1/1/2018 up until 8/23/2019, column A and column B. import pandas as pd df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB')) df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019')) df.set_index('date') df is as follows: date A B 2018-01-01 7 4 2018-01-02 5 4 2018-01-03 3 1 2018-01-04 9 3 2018-01-05 7 8 2018-01-06 0 0 2018-01-07 6 8 2018-01-08 3 7 ... ... ... 2019-08-18 1 0 2019-08-19 8 1 2019-08-20 5 9 2019-08-21 0 7 2019-08-22 3 6 2019-08-23 8 6 I want monthly accumulated values of column A and monthly averaged values of column B. The final output will become a df with 20 rows ( 12 months of year 2018 and 8 months of year 2019 ) and 4 columns, representing monthly accumulated values of column A, monthly averaged values of column B, month number and year number just like below: month year monthly_accumulated_of_A monthly_averaged_of_B 0 1 2018 176 1.747947 1 2 2018 110 2.399476 2 3 2018 131 3.976747 3 4 2018 227 2.314923 4 5 2018 234 0.464097 5 6 2018 249 1.662753 6 7 2018 121 1.588865 7 8 2018 165 2.318268 8 9 2018 219 1.060595 9 10 2018 131 0.577268 10 11 2018 179 3.948414 11 12 2018 115 1.750346 12 1 2019 190 3.364003 13 2 2019 215 0.864792 14 3 2019 231 3.219739 15 4 2019 186 2.904413 16 5 2019 232 0.324695 17 6 2019 163 1.334139 18 7 2019 238 1.670644 19 8 2019 112 1.316442 How can I achieve this in pandas?
Use DataFrameGroupBy.agg with DatetimeIndex.month and DatetimeIndex.year, for ordering add sort_index and last use reset_index for columns from MultiIndex: import pandas as pd import numpy as np np.random.seed(2018) #changed 300 to 600 df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB')) df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019')) df = df.set_index('date') df1 = (df.groupby([df.index.month.rename('month'), df.index.year.rename('year')]) .agg({'A':'sum', 'B':'mean'}) .sort_index(level=['year', 'month']) .reset_index()) print (df1) month year A B 0 1 2018 147 4.838710 1 2 2018 120 3.678571 2 3 2018 114 4.387097 3 4 2018 143 3.800000 4 5 2018 124 3.870968 5 6 2018 129 4.700000 6 7 2018 143 3.935484 7 8 2018 118 5.483871 8 9 2018 150 5.500000 9 10 2018 139 4.225806 10 11 2018 136 4.933333 11 12 2018 141 4.548387 12 1 2019 137 4.709677 13 2 2019 120 4.964286 14 3 2019 167 4.935484 15 4 2019 121 4.200000 16 5 2019 133 4.129032 17 6 2019 140 5.066667 18 7 2019 189 4.677419 19 8 2019 100 3.695652
Grouping data series by day intervals with Pandas
I have to perform some data analysis on a seasonal basis. I have circa one and a half years worth of hourly measurements, from the end of 2015 to the second half of 2017. What I want to do is to sort this data in seasons. Here's an example of the data I am working with: Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2 04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2 19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4 19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1 11/06/2016,2016,6,11,7,19,0,7,1395,837,18.8 11/06/2016,2016,6,11,7,20,0,7,1370,822,17.4 11/06/2016,2016,6,11,7,21,0,7,1364,818,17 11/06/2016,2016,6,11,7,22,0,7,1433,860,17.5 04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6 04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4 07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2 07/03/2017,2017,3,7,3,15,0,3,3666,2200,14 24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4 24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5 24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4 15/06/2017,2017,6,15,5,13,1,1,2590,1554,22.5 15/06/2017,2017,6,15,5,14,1,1,2629,1577,22.5 15/06/2017,2017,6,15,5,15,1,1,2656,1594,22.1 15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6 15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9 15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8 15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3 As you can see I have data on three different years. What I was thinking to do is to convert the first column with the pd.to_datetime() command. Then to group the rows according to the day/month, regardless of the year in dd/mm intervals (if winter goes from the 21/12 to the 21/03, create a new dataframe with all of those rows in which the date is included in this interval, regardless of the year), but I couldn't do it by neglecting the year (which make things more complicated). EDIT: A desired output would be: df_spring Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4 19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1 07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2 07/03/2017,2017,3,7,3,15,0,3,3666,2200,14 24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4 24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5 24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4 df_autumn Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C) 04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2 04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2 04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6 04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4 15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6 15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9 15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8 15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3 And so on for the remaining seasons.
Define each season by filtering the relevant rows using Day and Month columns as presented for winter: df_winter = df.loc[((df['Day'] >= 21) & (df['Month'] == 12)) | (df['Month'] == 1) | (df['Month'] == 2) | ((df['Day'] <= 21) & (df['Month'] == 3))]
you can simply filter your dataframe by month.isin() # spring df[df['Month'].isin([3,4])] Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C) 2 19/04/2016 2016 4 19 3 3 0 3 1348 809 14.4 3 19/04/2016 2016 4 19 3 4 0 3 1353 812 14.1 10 07/03/2017 2017 3 7 3 14 0 3 3668 2201 14.2 11 07/03/2017 2017 3 7 3 15 0 3 3666 2200 14.0 12 24/04/2017 2017 4 24 2 5 0 2 1347 808 11.4 13 24/04/2017 2017 4 24 2 6 0 2 1816 1090 11.5 14 24/04/2017 2017 4 24 2 7 0 2 2918 1751 12.4 # autumn df[df['Month'].isin([11,12])] Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C) 0 04/12/2015 2015 12 4 6 18 0 6 2968 1781 16.2 1 04/12/2015 2015 12 4 6 19 0 6 2437 1462 16.2 8 04/12/2016 2016 12 4 1 17 0 1 1425 855 14.6 9 04/12/2016 2016 12 4 1 18 0 1 1466 880 14.4 18 15/11/2017 2017 11 15 4 13 0 4 3765 2259 15.6 19 15/11/2017 2017 11 15 4 14 0 4 3873 2324 15.9 20 15/11/2017 2017 11 15 4 15 0 4 3905 2343 15.8 21 15/11/2017 2017 11 15 4 16 0 4 3861 2317 15.3