Pandas Resampling biweekly between a specific week range - python

I have an 8-year timeseries with daily values where I would like to resample biweekly. However, I only need biweekly values from week 18 to week 30 of each year (i.e. W18, W20, W22, ..., W30). This method would sometimes give me the 'odd' biweekly values (i.e. W19, W21, W23,..., W29). How might I ensure that I would always get the 'even' biweekly values?
df = df.resample("2W").mean()
df["Week"] = df.index.map(lambda dt: dt.week)
df = df.loc[df.Week.isin(range(18,31))]
An example of the daily data from 2010-01-01 to 2018-12-31: (short version)
Date value_1 value_2
... ... ...
2010-05-03 10 1
2010-05-04 79 66
2010-05-05 40 16
2010-05-06 13 76
2010-05-07 2 36
2010-05-08 31 98
2010-05-09 96 3
2010-05-10 66 18
2010-05-11 99 9
... ... ...
Expected biweekly data between week 18 and week 30:
Date value_1 value_2 Week
2010-05-03 14 1 18
2010-05-17 33 89 20
2010-05-31 21 31 22
2010-06-14 33 56 24
2010-06-28 12 43 26
2010-07-12 21 72 28
2010-07-26 76 13 30
2011-05-02 60 28 18
2011-05-16 82 2 20
2011-05-30 30 15 22
... ... ... ...

I think that the best way is to create the range separately with list comprehension. The code below will give a range between 18 and 30 with only even values:
weeks_to_include = [i for i in range(18, 31) if i % 2 == 0]
With this range you can filter as you have above. I tested the code below and it worked for me:
#create a dummy dataframe
dr = pd.date_range(start='2013-01-01', end='2021-12-31', freq='D')
df = pd.DataFrame(index=dr)
df['col1'] = range(0, len(df))
#create a list of even weeks in a range
weeks_to_include = [i for i in range(18, 31) if i % 2 == 0]
#create a column with the week of the year
df['weekofyear'] = df.index.isocalendar().week
#filter for only weeks_to_include
df = df.loc[df['weekofyear'].isin(weeks_to_include)]

Related

How to convert time string into hourly data?

I have a pandas dataframe of energy demand vs. time:
0 1
0 20201231T23-07 39815
1 20201231T22-07 41387
2 20201231T21-07 42798
3 20201231T20-07 44407
4 20201231T19-07 45612
5 20201231T18-07 44920
6 20201231T17-07 42617
7 20201231T16-07 41454
8 20201231T15-07 41371
9 20201231T14-07 41793
10 20201231T13-07 42298
11 20201231T12-07 42740
12 20201231T11-07 43185
13 20201231T10-07 42999
14 20201231T09-07 42373
15 20201231T08-07 41273
16 20201231T07-07 38909
17 20201231T06-07 37099
18 20201231T05-07 36022
19 20201231T04-07 35880
20 20201231T03-07 36305
21 20201231T02-07 36988
22 20201231T01-07 38166
23 20201231T00-07 40167
24 20201230T23-07 42624
25 20201230T22-07 44777
26 20201230T21-07 46205
27 20201230T20-07 47324
28 20201230T19-07 48011
29 20201230T18-07 46995
30 20201230T17-07 44902
31 20201230T16-07 44134
32 20201230T15-07 44228
33 20201230T14-07 44813
34 20201230T13-07 45187
35 20201230T12-07 45622
36 20201230T11-07 45831
37 20201230T10-07 45832
38 20201230T09-07 45476
39 20201230T08-07 44145
40 20201230T07-07 41650
I need to convert the time column into hourly data. I know that Python has some tools that can convert dates directly, is there one I could use here or will I need to do it manually?
Well just to obtain a time string you could use str.replace:
df["time"] = df["0"].str.replace(r'^\d{8}T(\d{2})-(\d{2})$', r'\1:\2')
Assuming the time column is currently a string you could convert it to a datetime using pd.to_datetime and then extract the hour.
If you want to calculate, say, the average demand for each hour you could then use groupby.
df['time'] = pd.to_datetime(df['time'], format="%Y%m%dT%H-%M").dt.hour
df_demand_by_hour = df.groupby('time').mean()
print(df_demand_by_hour)
demand
time
0 40167.0
1 38166.0
2 36988.0
3 36305.0
4 35880.0
5 36022.0
6 37099.0
7 40279.5
8 42709.0
9 43924.5
10 44415.5
11 44508.0
12 44181.0
13 43742.5
14 43303.0
15 42799.5
16 42794.0
17 43759.5
18 45957.5
19 46811.5
20 45865.5
21 44501.5
22 43082.0
23 41219.5
i don't know exactly what the -07 means but you can turn the string to datetime by doing:
import pandas as pd
import datetime as dt
df['0'] = pd.to_datetime(df['0'], format = '%Y-%m-%d %H:%M:%S').dt.strftime('%H:%M:%S')
df
0 1
0 23:00:00 39815
1 22:00:00 41387
2 21:00:00 42798
3 20:00:00 44407
4 19:00:00 45612
...

Get a week startdate from week number for entire dateframe in python

I am looking for week start date for entire date frame , with format of dd-mm-yyyy,
Below week number :(src_data['WEEK'])
28
29
30
31
32
33
34
35
code :
src_data['firstdayofweek'] = datetime.datetime.strptime(f'{2020}-W{int(src_data['WEEK'] )- 1}-1','%Y-W%W-%w').date()
Output :
Thanks in advance
You can add a year and weekday as strings and parse to_datetime with the appropriate directives (see also here). If desired, convert to string with strftime:
src_data = pd.DataFrame({'WEEK':[28,29,30,31,32,33,34,35]})
year, weekday = '2020', '1'
src_data['DATE'] = pd.to_datetime(year + src_data['WEEK'].astype(str) + weekday,
format='%G%V%u').dt.strftime('%d-%m-%Y')
# src_data
# WEEK DATE
# 0 28 06-07-2020
# 1 29 13-07-2020
# 2 30 20-07-2020
# 3 31 27-07-2020
# 4 32 03-08-2020
# 5 33 10-08-2020
# 6 34 17-08-2020
# 7 35 24-08-2020

extract number of days in month from date column, return days to date in current month

I have a pandas dataframe with a date column
I'm trying to create a function and apply it to the dataframe to create a column that returns the number of days in the month/year specified
so far i have:
from calendar import monthrange
def dom(x):
m = dfs["load_date"].dt.month
y = dfs["load_date"].dt.year
monthrange(y,m)
days = monthrange[1]
return days
This however does not work when I attempt to apply it to the date column.
Additionally, I would like to be able to identify whether or not it is the current month, and if so return the number of days up to the current date in that month as opposed to days in the entire month.
I am not sure of the best way to do this, all I can think of is to check the month/year against datetime's today and then use a delta
thanks in advance
For pt.1 of your question, you can cast to pd.Period and retrieve days_in_month:
import pandas as pd
# create a sample df:
df = pd.DataFrame({'date': pd.date_range('2020-01', '2021-01', freq='M')})
df['daysinmonths'] = df['date'].apply(lambda t: pd.Period(t, freq='S').days_in_month)
# df['daysinmonths']
# 0 31
# 1 29
# 2 31
# ...
For pt.2, you can take the timestamp of 'now' and create a boolean mask for your date column, i.e. where its year/month is less than "now". Then calculate the cumsum of the daysinmonth column for the section where the mask returns True. Invert the order of that series to get the days until now.
now = pd.Timestamp('now')
m = (df['date'].dt.year <= now.year) & (df['date'].dt.month < now.month)
df['daysuntilnow'] = df['daysinmonths'][m].cumsum().iloc[::-1].reset_index(drop=True)
Update after comment: to get the elapsed days per month, you can do
df['dayselapsed'] = df['daysinmonths']
m = (df['date'].dt.year == now.year) & (df['date'].dt.month == now.month)
if m.any():
df.loc[m, 'dayselapsed'] = now.day
df.loc[(df['date'].dt.year >= now.year) & (df['date'].dt.month > now.month), 'dayselapsed'] = 0
output
df
Out[13]:
date daysinmonths daysuntilnow dayselapsed
0 2020-01-31 31 213.0 31
1 2020-02-29 29 182.0 29
2 2020-03-31 31 152.0 31
3 2020-04-30 30 121.0 30
4 2020-05-31 31 91.0 31
5 2020-06-30 30 60.0 30
6 2020-07-31 31 31.0 31
7 2020-08-31 31 NaN 27
8 2020-09-30 30 NaN 0
9 2020-10-31 31 NaN 0
10 2020-11-30 30 NaN 0
11 2020-12-31 31 NaN 0

filtering date column in python

I'm new to python and I'm facing the following problem. I have a dataframe composed of 2 columns, one of them is date (datetime64[ns]). I want to keep all records within the last 12 months. My code is the following:
today=start_time.date()
last_year = today + relativedelta(months = -12)
new_df = df[pd.to_datetime(df.mydate) >= last_year]
when I run it I get the following message:
TypeError: type object 2017-06-05
Any ideas?
last_year seems to bring me the date that I want in the following format: 2017-06-05
Create a time delta object in pandas to increment the date (12 months). Call pandas.Timstamp('now') to get the current date. And then create a date_range. Here is an example for getting monthly data for 12 months.
import pandas as pd
import datetime
list_1 = [i for i in range(0, 12)]
list_2 = [i for i in range(13, 25)]
list_3 = [i for i in range(26, 38)]
data_frame = pd.DataFrame({'A': list_1, 'B': list_2, 'C':list_3}, pd.date_range(pd.Timestamp('now'), pd.Timestamp('now') + pd.Timedelta (weeks=53), freq='M'))
We create a timestamp for the current date and enter that as our start date. Then we create a timedelta to increment that date by 53 weeks (or 52 if you'd like) which gets us 12 months of data. Below is the output:
A B C
2018-06-30 05:05:21.335625 0 13 26
2018-07-31 05:05:21.335625 1 14 27
2018-08-31 05:05:21.335625 2 15 28
2018-09-30 05:05:21.335625 3 16 29
2018-10-31 05:05:21.335625 4 17 30
2018-11-30 05:05:21.335625 5 18 31
2018-12-31 05:05:21.335625 6 19 32
2019-01-31 05:05:21.335625 7 20 33
2019-02-28 05:05:21.335625 8 21 34
2019-03-31 05:05:21.335625 9 22 35
2019-04-30 05:05:21.335625 10 23 36
2019-05-31 05:05:21.335625 11 24 37
Try
today = datetime.datetime.now()
You can use pandas functionality with datetime objects. The syntax is often more intuitive and obviates the need for additional imports.
last_year = pd.to_datetime('today') + pd.DateOffset(years=-1)
new_df = df[pd.to_datetime(df.mydate) >= last_year]
As such, we would need to see all your code to be sure of the reason behind your error; for example, how is start_time defined?

Getting the average of a certain hour on weekdays over several years in a pandas dataframe

I have an hourly dataframe in the following format over several years:
Date/Time Value
01.03.2010 00:00:00 60
01.03.2010 01:00:00 50
01.03.2010 02:00:00 52
01.03.2010 03:00:00 49
.
.
.
31.12.2013 23:00:00 77
I would like to average the data so I can get the average of hour 0, hour 1... hour 23 of each of the years.
So the output should look somehow like this:
Year Hour Avg
2010 00 63
2010 01 55
2010 02 50
.
.
.
2013 22 71
2013 23 80
Does anyone know how to obtain this in pandas?
Note: Now that Series have the dt accessor it's less important that date is the index, though Date/Time still needs to be a datetime64.
Update: You can do the groupby more directly (without the lambda):
In [21]: df.groupby([df["Date/Time"].dt.year, df["Date/Time"].dt.hour]).mean()
Out[21]:
Value
Date/Time Date/Time
2010 0 60
1 50
2 52
3 49
In [22]: res = df.groupby([df["Date/Time"].dt.year, df["Date/Time"].dt.hour]).mean()
In [23]: res.index.names = ["year", "hour"]
In [24]: res
Out[24]:
Value
year hour
2010 0 60
1 50
2 52
3 49
If it's a datetime64 index you can do:
In [31]: df1.groupby([df1.index.year, df1.index.hour]).mean()
Out[31]:
Value
2010 0 60
1 50
2 52
3 49
Old answer (will be slower):
Assuming Date/Time was the index* you can use a mapping function in the groupby:
In [11]: year_hour_means = df1.groupby(lambda x: (x.year, x.hour)).mean()
In [12]: year_hour_means
Out[12]:
Value
(2010, 0) 60
(2010, 1) 50
(2010, 2) 52
(2010, 3) 49
For a more useful index, you could then create a MultiIndex from the tuples:
In [13]: year_hour_means.index = pd.MultiIndex.from_tuples(year_hour_means.index,
names=['year', 'hour'])
In [14]: year_hour_means
Out[14]:
Value
year hour
2010 0 60
1 50
2 52
3 49
* if not, then first use set_index:
df1 = df.set_index('Date/Time')
If your date/time column were in the datetime format (see dateutil.parser for automatic parsing options), you can use pandas resample as below:
year_hour_means = df.resample('H',how = 'mean')
which will keep your data in the datetime format. This may help you with whatever you are going to be doing with your data down the line.

Categories