Pandas mapping with multiple conditions - python

I have a dataframe consisting of Year, Month, Temperature. Now, I need to create seasonal means, such as DJF (Dec, Jan, Feb), MAM (Mar, Apr, May), JJA (Jun, Jul, Aug), SON (Sep, Oct, Nov).
But how can I take into account the fact that DJF should have December from the previous year, January and February of the subsequent year?
This is the code I have so far:
z = {1: 'DJF', 2: 'DJF', 3: 'MAM', 4: 'MAM', 5: 'MAM', 6: 'JJA', 7: 'JJA', 8: 'JJA', 9: 'SON', 10: 'SON',
11: 'SON', 12: 'DJF'}
df['season'] = df['Mon'].map(z)
The problem with the above coding is that when I group by year and season to calculate the means, the values for DJF will be incorrect, since they take Dec, Jan, and Feb of the same year.
df.groupby(['Year','season']).mean()

I think you can create periodindex by to_datetime and to_period
Then shift one moth and convert to Quarters by asfreq.
Last groupby by index anf aggregate mean:
df['Day'] = 1
df.index = pd.to_datetime(df[['Year','Month','Day']]).dt.to_period('M')
df = df.shift(1, freq='M').asfreq('Q')
print (df.groupby(level=0)['Temperature'].mean())
Sample:
rng = pd.date_range('2017-04-03', periods=20, freq='M')
df = pd.DataFrame({'Date': rng, 'Temperature': range(20)})
df['Year'] = df.Date.dt.year
df['Month'] = df.Date.dt.month
df = df.drop('Date', axis=1)
print (df)
Temperature Year Month
0 0 2017 4
1 1 2017 5
2 2 2017 6
3 3 2017 7
4 4 2017 8
5 5 2017 9
6 6 2017 10
7 7 2017 11
8 8 2017 12
9 9 2018 1
10 10 2018 2
11 11 2018 3
12 12 2018 4
13 13 2018 5
14 14 2018 6
15 15 2018 7
16 16 2018 8
17 17 2018 9
18 18 2018 10
19 19 2018 11
df['Day'] = 1
df.index = pd.to_datetime(df[['Year','Month','Day']]).dt.to_period('M')
df = df.shift(1, freq='M').asfreq('Q')
print (df)
Temperature Year Month Day
2017Q2 0 2017 4 1
2017Q2 1 2017 5 1
2017Q3 2 2017 6 1
2017Q3 3 2017 7 1
2017Q3 4 2017 8 1
2017Q4 5 2017 9 1
2017Q4 6 2017 10 1
2017Q4 7 2017 11 1
2018Q1 8 2017 12 1
2018Q1 9 2018 1 1
2018Q1 10 2018 2 1
2018Q2 11 2018 3 1
2018Q2 12 2018 4 1
2018Q2 13 2018 5 1
2018Q3 14 2018 6 1
2018Q3 15 2018 7 1
2018Q3 16 2018 8 1
2018Q4 17 2018 9 1
2018Q4 18 2018 10 1
2018Q4 19 2018 11 1
print (df.groupby(level=0)['Temperature'].mean())
2017Q2 0.5
2017Q3 3.0
2017Q4 6.0
2018Q1 9.0
2018Q2 12.0
2018Q3 15.0
2018Q4 18.0
Freq: Q-DEC, Name: Temperature, dtype: float64
And last if need season column:
df1 = df.groupby(level=0)['Temperature'].mean().rename_axis('per').reset_index()
z = {1: 'DJF',2: 'MAM', 3: 'JJA', 4: 'SON'}
df1['season'] = df1['per'].dt.quarter.map(z)
df1['yaer'] = df1['per'].dt.year
print (df1)
per Temperature season yaer
0 2017Q2 0.5 MAM 2017
1 2017Q3 3.0 JJA 2017
2 2017Q4 6.0 SON 2017
3 2018Q1 9.0 DJF 2018
4 2018Q2 12.0 MAM 2018
5 2018Q3 15.0 JJA 2018
6 2018Q4 18.0 SON 2018

Related

Convert a Python data frame with diferents 'year' column into continue time series

It is posible to convert a dataframe on Pandas like that:
Into a time series where each year its behind the last one
This is likely what df.unstack(level=1) is meant for.
np.random.seed(111) # reproducibility
df = pd.DataFrame(
data={
"2009": np.random.randn(12),
"2010": np.random.randn(12),
"2011": np.random.randn(12),
},
index=range(1, 13)
)
print(df)
Out[45]:
2009 2010 2011
1 -1.133838 -1.440585 0.570594
2 0.384319 0.773703 0.915420
3 1.496554 -1.027967 -1.669341
4 -0.355382 -0.090986 0.482714
5 -0.787534 0.492003 -0.310473
6 -0.459439 0.424672 2.394690
7 -0.059169 1.283049 1.550931
8 -0.354174 0.315986 -0.646465
9 -0.735523 -0.408082 -0.928937
10 -1.183940 -0.067948 -1.654976
11 0.238894 -0.952427 0.350193
12 -0.589920 -0.110677 -0.141757
df_out = df.unstack(1).reset_index()
df_out.columns = ["year", "month", "value"]
print(df_out)
Out[46]:
year month value
0 2009 1 -1.133838
1 2009 2 0.384319
2 2009 3 1.496554
3 2009 4 -0.355382
4 2009 5 -0.787534
5 2009 6 -0.459439
6 2009 7 -0.059169
7 2009 8 -0.354174
8 2009 9 -0.735523
9 2009 10 -1.183940
10 2009 11 0.238894
11 2009 12 -0.589920
12 2010 1 -1.440585
13 2010 2 0.773703
14 2010 3 -1.027967
15 2010 4 -0.090986
16 2010 5 0.492003
17 2010 6 0.424672
18 2010 7 1.283049
19 2010 8 0.315986
20 2010 9 -0.408082
21 2010 10 -0.067948
22 2010 11 -0.952427
23 2010 12 -0.110677
24 2011 1 0.570594
25 2011 2 0.915420
26 2011 3 -1.669341
27 2011 4 0.482714
28 2011 5 -0.310473
29 2011 6 2.394690
30 2011 7 1.550931
31 2011 8 -0.646465
32 2011 9 -0.928937
33 2011 10 -1.654976
34 2011 11 0.350193
35 2011 12 -0.141757

How to calculate cumulative sum in python using pandas of all the columns except the first one that contain names?

Here's the data in csv format:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
Jack 1 15 25 3 5 11 5 8 3
Jill 5 10 32 5 5 14 6 8 7
I don't want Name column to be include as it gives an error.
I tried
df.cumsum()
Try with set_index and reset_index to keep the name column:
df.set_index('Name').cumsum().reset_index()
Output:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Jack 1 15 25 3 5 11 5 8 3
1 Jill 6 25 57 8 10 25 11 16 10

Fill column with value from previous year from the same month

How can I use the value from the same month in the previous year to fill values in the following table for 2020:
Category Month Year Value
A 1 2019 15
B 2 2019 20
A 2 2019 90
A 3 2019 50
B 4 2019 40
A 5 2019 20
A 6 2019 15
A 7 2019 17
A 8 2019 18
A 9 2019 12
A 10 2019 11
A 11 2019 19
A 12 2019 15
A 1 2020 18
A 2 2020 53
A 3 2020 80
The final desired result is the following:
Category Month Year Value
A 1 2019 15
B 2 2019 20
A 2 2019 90
A 3 2019 50
B 4 2019 40
A 4 2019 40
A 5 2019 20
A 6 2019 15
A 7 2019 17
A 8 2019 18
A 9 2019 12
A 10 2019 11
A 11 2019 19
A 12 2019 15
A 1 2020 18
A 2 2020 53
A 3 2020 80
B 4 2020 40
A 4 2020 40
A 5 2020 20
A 6 2020 15
A 7 2020 17
A 8 2020 18
A 9 2020 12
A 10 2020 11
A 11 2020 19
A 12 2020 15
I tried using pandas groupby but not sure if that is the right approach.
IIUC we use the pivot then ffill with stack
s=df.pivot_table(index=['Category','Year'],columns='Month',values='Value').groupby(level=0).ffill().stack().reset_index()
Category Year level_2 0
0 A 2019 1 15.0
1 A 2019 2 90.0
2 A 2019 3 50.0
3 A 2019 5 20.0
4 A 2019 6 15.0
5 A 2019 7 17.0
6 A 2019 8 18.0
7 A 2019 9 12.0
8 A 2019 10 11.0
9 A 2019 11 19.0
10 A 2019 12 15.0
11 A 2020 1 18.0
12 A 2020 2 53.0
13 A 2020 3 80.0
14 A 2020 5 20.0
15 A 2020 6 15.0
16 A 2020 7 17.0
17 A 2020 8 18.0
18 A 2020 9 12.0
19 A 2020 10 11.0
20 A 2020 11 19.0
21 A 2020 12 15.0
22 B 2019 2 20.0
23 B 2019 4 40.0
You can accomplish this with a combination of loc, concat, and drop_duplicates.
The idea here is to concatenate the dataframe with a copy of the 2019 data where year is changed to 2020, and then only keeping the first value for Category, Month, Year.
df2 = df.loc[df['Year'] == 2019, :]
df2['Year'] = 2020
pd.concat([df, df2]).drop_duplicates(subset=['Category', 'Month', 'Year'], keep='first')
Output
Category Month Year Value
0 A 1 2019 15
1 B 2 2019 20
2 A 2 2019 90
3 A 3 2019 50
4 B 4 2019 40
5 A 5 2019 20
6 A 6 2019 15
7 A 7 2019 17
8 A 8 2019 18
9 A 9 2019 12
10 A 10 2019 11
11 A 11 2019 19
12 A 12 2019 15
13 A 1 2020 18
14 A 2 2020 53
15 A 3 2020 80
1 B 2 2020 20
4 B 4 2020 40
5 A 5 2020 20
6 A 6 2020 15
7 A 7 2020 17
8 A 8 2020 18
9 A 9 2020 12
10 A 10 2020 11
11 A 11 2020 19
12 A 12 2020 15

How to use groupby and grouper properly for accumulating column 'A' and averaging column 'B', month by month

I have a pandas data with 3 columns:
date: from 1/1/2018 up until 8/23/2019, column A and column B.
import pandas as pd
df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB'))
df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019'))
df.set_index('date')
df is as follows:
date A B
2018-01-01 7 4
2018-01-02 5 4
2018-01-03 3 1
2018-01-04 9 3
2018-01-05 7 8
2018-01-06 0 0
2018-01-07 6 8
2018-01-08 3 7
...
...
...
2019-08-18 1 0
2019-08-19 8 1
2019-08-20 5 9
2019-08-21 0 7
2019-08-22 3 6
2019-08-23 8 6
I want monthly accumulated values of column A and monthly averaged values of column B. The final output will become a df with 20 rows ( 12 months of year 2018 and 8 months of year 2019 ) and 4 columns, representing monthly accumulated values of column A, monthly averaged values of column B, month number and year number just like below:
month year monthly_accumulated_of_A monthly_averaged_of_B
0 1 2018 176 1.747947
1 2 2018 110 2.399476
2 3 2018 131 3.976747
3 4 2018 227 2.314923
4 5 2018 234 0.464097
5 6 2018 249 1.662753
6 7 2018 121 1.588865
7 8 2018 165 2.318268
8 9 2018 219 1.060595
9 10 2018 131 0.577268
10 11 2018 179 3.948414
11 12 2018 115 1.750346
12 1 2019 190 3.364003
13 2 2019 215 0.864792
14 3 2019 231 3.219739
15 4 2019 186 2.904413
16 5 2019 232 0.324695
17 6 2019 163 1.334139
18 7 2019 238 1.670644
19 8 2019 112 1.316442
​
How can I achieve this in pandas?
Use DataFrameGroupBy.agg with DatetimeIndex.month and DatetimeIndex.year, for ordering add sort_index and last use reset_index for columns from MultiIndex:
import pandas as pd
import numpy as np
np.random.seed(2018)
#changed 300 to 600
df = pd.DataFrame(np.random.randint(0,10,size=(600, 2)), columns=list('AB'))
df['date'] = pd.DataFrame(pd.date_range(start='1/1/2018', end='8/23/2019'))
df = df.set_index('date')
df1 = (df.groupby([df.index.month.rename('month'),
df.index.year.rename('year')])
.agg({'A':'sum', 'B':'mean'})
.sort_index(level=['year', 'month'])
.reset_index())
print (df1)
month year A B
0 1 2018 147 4.838710
1 2 2018 120 3.678571
2 3 2018 114 4.387097
3 4 2018 143 3.800000
4 5 2018 124 3.870968
5 6 2018 129 4.700000
6 7 2018 143 3.935484
7 8 2018 118 5.483871
8 9 2018 150 5.500000
9 10 2018 139 4.225806
10 11 2018 136 4.933333
11 12 2018 141 4.548387
12 1 2019 137 4.709677
13 2 2019 120 4.964286
14 3 2019 167 4.935484
15 4 2019 121 4.200000
16 5 2019 133 4.129032
17 6 2019 140 5.066667
18 7 2019 189 4.677419
19 8 2019 100 3.695652

Grouping data series by day intervals with Pandas

I have to perform some data analysis on a seasonal basis.
I have circa one and a half years worth of hourly measurements, from the end of 2015 to the second half of 2017. What I want to do is to sort this data in seasons.
Here's an example of the data I am working with:
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2
04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2
19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4
19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1
11/06/2016,2016,6,11,7,19,0,7,1395,837,18.8
11/06/2016,2016,6,11,7,20,0,7,1370,822,17.4
11/06/2016,2016,6,11,7,21,0,7,1364,818,17
11/06/2016,2016,6,11,7,22,0,7,1433,860,17.5
04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6
04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4
07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2
07/03/2017,2017,3,7,3,15,0,3,3666,2200,14
24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4
24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5
24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4
15/06/2017,2017,6,15,5,13,1,1,2590,1554,22.5
15/06/2017,2017,6,15,5,14,1,1,2629,1577,22.5
15/06/2017,2017,6,15,5,15,1,1,2656,1594,22.1
15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6
15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9
15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8
15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3
As you can see I have data on three different years.
What I was thinking to do is to convert the first column with the pd.to_datetime() command. Then to group the rows according to the day/month, regardless of the year in dd/mm intervals (if winter goes from the 21/12 to the 21/03, create a new dataframe with all of those rows in which the date is included in this interval, regardless of the year), but I couldn't do it by neglecting the year (which make things more complicated).
EDIT:
A desired output would be:
df_spring
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
19/04/2016,2016,4,19,3,3,0,3,1348,809,14.4
19/04/2016,2016,4,19,3,4,0,3,1353,812,14.1
07/03/2017,2017,3,7,3,14,0,3,3668,2201,14.2
07/03/2017,2017,3,7,3,15,0,3,3666,2200,14
24/04/2017,2017,4,24,2,5,0,2,1347,808,11.4
24/04/2017,2017,4,24,2,6,0,2,1816,1090,11.5
24/04/2017,2017,4,24,2,7,0,2,2918,1751,12.4
df_autumn
Date,Year,Month,Day,Day week,Hour,Holiday,Week Day,Impulse,Power (kW),Temperature (C)
04/12/2015,2015,12,4,6,18,0,6,2968,1781,16.2
04/12/2015,2015,12,4,6,19,0,6,2437,1462,16.2
04/12/2016,2016,12,4,1,17,0,1,1425,855,14.6
04/12/2016,2016,12,4,1,18,0,1,1466,880,14.4
15/11/2017,2017,11,15,4,13,0,4,3765,2259,15.6
15/11/2017,2017,11,15,4,14,0,4,3873,2324,15.9
15/11/2017,2017,11,15,4,15,0,4,3905,2343,15.8
15/11/2017,2017,11,15,4,16,0,4,3861,2317,15.3
And so on for the remaining seasons.
Define each season by filtering the relevant rows using Day and Month columns as presented for winter:
df_winter = df.loc[((df['Day'] >= 21) & (df['Month'] == 12)) | (df['Month'] == 1) | (df['Month'] == 2) | ((df['Day'] <= 21) & (df['Month'] == 3))]
you can simply filter your dataframe by month.isin()
# spring
df[df['Month'].isin([3,4])]
Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C)
2 19/04/2016 2016 4 19 3 3 0 3 1348 809 14.4
3 19/04/2016 2016 4 19 3 4 0 3 1353 812 14.1
10 07/03/2017 2017 3 7 3 14 0 3 3668 2201 14.2
11 07/03/2017 2017 3 7 3 15 0 3 3666 2200 14.0
12 24/04/2017 2017 4 24 2 5 0 2 1347 808 11.4
13 24/04/2017 2017 4 24 2 6 0 2 1816 1090 11.5
14 24/04/2017 2017 4 24 2 7 0 2 2918 1751 12.4
# autumn
df[df['Month'].isin([11,12])]
Date Year Month Day Day week Hour Holiday Week Day Impulse Power (kW) Temperature (C)
0 04/12/2015 2015 12 4 6 18 0 6 2968 1781 16.2
1 04/12/2015 2015 12 4 6 19 0 6 2437 1462 16.2
8 04/12/2016 2016 12 4 1 17 0 1 1425 855 14.6
9 04/12/2016 2016 12 4 1 18 0 1 1466 880 14.4
18 15/11/2017 2017 11 15 4 13 0 4 3765 2259 15.6
19 15/11/2017 2017 11 15 4 14 0 4 3873 2324 15.9
20 15/11/2017 2017 11 15 4 15 0 4 3905 2343 15.8
21 15/11/2017 2017 11 15 4 16 0 4 3861 2317 15.3

Categories