Pandas dataframe: omit weekends and days near holidays - python

I have a Pandas dataframe with a DataTimeIndex and some other columns, similar to this:
import pandas as pd
import numpy as np
range = pd.date_range('2017-12-01', '2018-01-05', freq='6H')
df = pd.DataFrame(index = range)
# Average speed in miles per hour
df['value'] = np.random.randint(low=0, high=60, size=len(df.index))
df.info()
# DatetimeIndex: 141 entries, 2017-12-01 00:00:00 to 2018-01-05 00:00:00
# Freq: 6H
# Data columns (total 1 columns):
# value 141 non-null int64
# dtypes: int64(1)
# memory usage: 2.2 KB
df.head(10)
# value
# 2017-12-01 00:00:00 15
# 2017-12-01 06:00:00 54
# 2017-12-01 12:00:00 19
# 2017-12-01 18:00:00 13
# 2017-12-02 00:00:00 35
# 2017-12-02 06:00:00 31
# 2017-12-02 12:00:00 58
# 2017-12-02 18:00:00 6
# 2017-12-03 00:00:00 8
# 2017-12-03 06:00:00 30
How can I select or filter the entries that are:
Weekdays only (that is, not weekend days Saturday or Sunday)
Not within N days of the dates in a list (e.g. U.S. holidays like '12-25' or '01-01')?
I was hoping for something like:
df = exclude_Sat_and_Sun(df)
omit_days = ['12-25', '01-01']
N = 3 # days near the holidays
df = exclude_days_near_omit_days(N, omit_days)
I was thinking of creating a new column to break out the month and day and then comparing them to the criteria for 1 and 2 above. However, I was hoping for something more Pythonic using the DateTimeIndex.
Thanks for any help.

The first part can be easily accomplished using the Pandas DatetimeIndex.dayofweek property, which starts counting weekdays with Monday as 0 and ending with Sunday as 6.
df[df.index.dayofweek < 5] will give you only the weekdays.
For the second part you can use the datetime module. Below I will give an example for only one date, namely 2017-12-25. You can easily generalize it to a list of dates, for example by defining a helper function.
from datetime import datetime, timedelta
N = 3
df[abs(df.index.date - datetime.strptime("2017-12-25", '%Y-%m-%d').date()) > timedelta(N)]
This will give all dates that are more than N=3 days away from 2017-12-25. That is, it will exclude an interval of 7 days from 2017-12-22 to 2017-12-28.
Lastly, you can combine the two criteria using the & operator, as you probably know.
df[
(df.index.dayofweek < 5)
&
(abs(df.index.date - datetime.strptime("2017-12-25", '%Y-%m-%d').date()) > timedelta(N))
]

I followed the answer by #Bahman Engheta and created a function to omit dates from a dataframe.
import pandas as pd
from datetime import datetime, timedelta
def omit_dates(df, list_years, list_dates, omit_days_near=3, omit_weekends=False):
'''
Given a Pandas dataframe with a DatetimeIndex, remove rows that have a date
near a given list of dates and/or a date on a weekend.
Parameters:
----------
df : Pandas dataframe
list_years : list of str
Contains a list of years in string form
list_dates : list of str
Contains a list of dates in string form encoded as MM-DD
omit_days_near : int
Threshold of days away from list_dates to remove. For example, if
omit_days_near=3, then omit all days that are 3 days away from
any date in list_dates.
omit_weekends : bool
If true, omit dates that are on weekends.
Returns:
-------
Pandas dataframe
New resulting dataframe with dates omitted.
'''
if not isinstance(df, pd.core.frame.DataFrame):
raise ValueError("df is expected to be a Pandas dataframe, not %s" % type(df).__name__)
if not isinstance(df.index, pd.tseries.index.DatetimeIndex):
raise ValueError("Dataframe is expected to have an index of DateTimeIndex, not %s" %
type(df.index).__name__)
if not isinstance(list_years, list):
list_years = [list_years]
if not isinstance(list_dates, list):
list_dates = [list_dates]
result = df.copy()
if omit_weekends:
result = result.loc[result.index.dayofweek < 5]
omit_dates = [ '%s-%s' % (year, date) for year in list_years for date in list_dates ]
for date in omit_dates:
result = result.loc[abs(result.index.date - datetime.strptime(date, '%Y-%m-%d').date()) > timedelta(omit_days_near)]
return result
Here is example usage. Suppose you have a dataframe that has a DateTimeIndex and other columns, like this:
import pandas as pd
import numpy as np
range = pd.date_range('2017-12-01', '2018-01-05', freq='1D')
df = pd.DataFrame(index = range)
df['value'] = np.random.randint(low=0, high=60, size=len(df.index))
The resulting dataframe looks like this:
value
2017-12-01 42
2017-12-02 35
2017-12-03 49
2017-12-04 25
2017-12-05 19
2017-12-06 28
2017-12-07 21
2017-12-08 57
2017-12-09 3
2017-12-10 57
2017-12-11 46
2017-12-12 20
2017-12-13 7
2017-12-14 5
2017-12-15 30
2017-12-16 57
2017-12-17 4
2017-12-18 46
2017-12-19 32
2017-12-20 48
2017-12-21 55
2017-12-22 52
2017-12-23 45
2017-12-24 34
2017-12-25 42
2017-12-26 33
2017-12-27 17
2017-12-28 2
2017-12-29 2
2017-12-30 51
2017-12-31 19
2018-01-01 6
2018-01-02 43
2018-01-03 11
2018-01-04 45
2018-01-05 45
Now, let's specify dates to remove. I want to remove the dates '12-10', '12-25', '12-31', and '01-01' (following MM-DD notation) and all dates within 2 days of those dates. Further, I want to remove those dates from both the years '2016' and '2017'. I also want to remove weekend dates.
I'll call my function like this:
years = ['2016', '2017']
holiday_dates = ['12-10', '12-25', '12-31', '01-01']
omit_dates(df, years, holiday_dates, omit_days_near=2, omit_weekends=True)
The result is:
value
2017-12-01 42
2017-12-04 25
2017-12-05 19
2017-12-06 28
2017-12-07 21
2017-12-13 7
2017-12-14 5
2017-12-15 30
2017-12-18 46
2017-12-19 32
2017-12-20 48
2017-12-21 55
2017-12-22 52
2017-12-28 2
2018-01-03 11
2018-01-04 45
2018-01-05 45
Is that answer correct? Here are the calendars for December 2017 and January 2018:
December 2017
Su Mo Tu We Th Fr Sa
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31
January 2018
Su Mo Tu We Th Fr Sa
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31
Looks like it works.

Related

Grouping of a dataframe monthly after calculating the highest daily values

I've got a dataframe with two columns one is datetime dataframe consisting of dates, and another one consists of quantity. It looks like something like this,
Date Quantity
0 2019-01-05 10
1 2019-01-10 15
2 2019-01-22 14
3 2019-02-03 12
4 2019-05-11 25
5 2019-05-21 4
6 2019-07-08 1
7 2019-07-30 15
8 2019-09-05 31
9 2019-09-10 44
10 2019-09-25 8
11 2019-12-09 10
12 2020-04-11 111
13 2020-04-17 5
14 2020-06-05 17
15 2020-06-16 12
16 2020-06-22 14
I want to make another dataframe. It should consist of two columns one is Month/Year and the other is Till Highest. I basically want to calculate the highest quantity value until that month and group it using month/year. Example of what I want precisely is,
Month/Year Till Highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In my case, the dataset is vast, and I've readings of almost every day of each month and each year in the specified timeline. Here I've made a dummy dataset to show an example of what I want.
Please help me with this. Thanks in advance :)
See the annotated code:
(df
# convert date to monthly period (2019-01)
.assign(Date=pd.to_datetime(df['Date']).dt.to_period('M'))
# period and max quantity per month
.groupby('Date')
.agg(**{'Month/Year': ('Date', 'first'),
'Till highest': ('Quantity', 'max')})
# format periods as Jan/2019 and get cumulated max quantity
.assign(**{
'Month/Year': lambda d: d['Month/Year'].dt.strftime('%b/%Y'),
'Till highest': lambda d: d['Till highest'].cummax()
})
# drop the groupby index
.reset_index(drop=True)
)
output:
Month/Year Till highest
0 Jan/2019 15
1 Feb/2019 15
2 May/2019 25
3 Jul/2019 25
4 Sep/2019 44
5 Dec/2019 44
6 Apr/2020 111
7 Jun/2020 111
In R you can use cummax:
df=data.frame(Date=c("2019-01-05","2019-01-10","2019-01-22","2019-02-03","2019-05-11","2019-05-21","2019-07-08","2019-07-30","2019-09-05","2019-09-10","2019-09-25","2019-12-09","2020-04-11","2020-04-17","2020-06-05","2020-06-16","2020-06-22"),Quantity=c(10,15,14,12,25,4,1,15,31,44,8,10,111,5,17,12,14))
data.frame(`Month/Year`=unique(format(as.Date(df$Date),"%b/%Y")),
`Till Highest`=cummax(tapply(df$Quantity,sub("-..$","",df$Date),max)),
check.names=F,row.names=NULL)
Month/Year Till Highest
1 Jan/2019 15
2 Feb/2019 15
3 May/2019 25
4 Jul/2019 25
5 Sep/2019 44
6 Dec/2019 44
7 Apr/2020 111
8 Jun/2020 111

Python - How to export monthly data into excel based on month?

I have dataframe that contain two columns. Date from 2018 until now and Orders with order count for each day.
Date Orders
0 2018-01-01 57
1 2018-01-02 324
2 2018-01-03 54
3 2018-01-04 677
4 2018-01-05 234
5 2018-01-06 54
6 2018-01-07 234
7 2018-01-08 65
8 2018-01-09 234
9 2018-01-10 54
10 2018-01-11 234
11 2018-01-12 65
12 2018-01-13 7
13 2018-01-14 6
14 2018-01-15 57
15 2018-01-16 324
16 2018-01-17 54
17 2018-01-18 677
18 2018-01-19 234
19 2018-01-20 54
...
I need to export this into multiple excel files so that every files contain only data for one particular month.
I am trying to work on this script but i am struck:
import pandas as pd
df = pd.read_excel("data/SampleData.xlsx")
for dates in Date:
currMonth = something???
filename = 'file_'+list(set(pd.to_datetime(df.loc[currMonth,
'datestart']).dt.strftime('%m%d%y')))[0]+'.xlsx'
df.loc[idx, 'data'].to_excel(filename)
So I think i have to create variable that will store start and end of each month and than iterate through it.
Any idea how to get this to work?
You might want to have a look here. You can use simple integers to address the month, so you should be able to iterate like this (not tested):
for month in range(1, 13):
df_per_month = df[df['Date'].dt.month == month]
df_per_month.to_excel(f'{month}.xlsx')
Edit: Note that according to docs, month ranges from 1-12.
Also, if you want to iterate month and year, you would have to do something like:
for year in range(2018, 2022):
for month in range(1, 13):
data = df[(df['Date'].dt.month == month) & (df['Date'].dt.year == year)]
data.to_excel(f'{month}-{year}.xlsx')

Converting object into time and grouping/summarizing time (H/M/S) into 24 hours

I subsetted a big dataframe, slicing only one column Start Time with `type(object).
test = taxi_2020['Start Time']
Got a column
0 00:15:00
1 00:15:00
2 00:15:00
3 00:15:00
4 00:15:00
...
4137289 00:00:00
4137290 00:00:00
4137291 00:00:00
4137292 00:00:00
4137293 00:00:00
Name: Start Time, Length: 4137294, dtype: object
Then I grouped and summarized it by the count (to my best knowledge)
test.value_counts().sort_index().reset_index()
and got two columns
index Start Time
0 00:00:00 24005
1 00:15:00 22815
2 00:30:00 20438
3 00:45:00 19012
4 01:00:00 18082
... ... ...
91 22:45:00 32365
92 23:00:00 31815
93 23:15:00 29582
94 23:30:00 26903
95 23:45:00 24599
Not sure why this index column appeared, now I failed to rename it or convert.
What do I would like to see?
My ideal output - to group time by hour (24h format is ok), it looks like data counts every 15 min, so basically put each next 4 columns together. 00:15:00 is ok to be as 0 hour, 23:00:00 as 23rd hour.
My ideal output:
Hour Rides
0 34000
1 60000
2 30000
3 40000
I would like to create afterwards a simple histogram to show the occurrence by the hour.
Appreciate any help!
IIUC,
#Create dummy input datafframe
test = pd.DataFrame({'time':pd.date_range('2020-06-01', '2020-06-01 23:59:00', freq='15T').strftime('%H:%M:%S'),
'rides':np.random.randint(15000,28000,96)})
Let's create a DateTimeIndex from string and resample, aggregate with sum and convert DateTimeIndex to hours:
test2 = (test.set_index(pd.to_datetime(test['time'], format='%H:%M:%S'))
.rename_axis('hour').resample('H').sum())
test2.index = test2.index.hour
test2.reset_index()
Output:
hour rides
0 0 74241
1 1 87329
2 2 76933
3 3 86208
4 4 88002
5 5 82618
6 6 82188
7 7 81203
8 8 78591
9 9 95592
10 10 99778
11 11 85294
12 12 93931
13 13 80490
14 14 84181
15 15 71786
16 16 90962
17 17 96568
18 18 85646
19 19 88324
20 20 83595
21 21 89284
22 22 72061
23 23 74057
Step by step I found answer myself
Using this code, I renamed columns
test.rename(columns = {'index': "Time", 'Start Time': 'Rides'})
Got
The remaining question - how to summarize by the hour.
After applying
test2['hour'] = pd.to_datetime(test2['Time'], format='%H:%M:%S').dt.hour
test2
I came closer
Finally, I grouped by hour value
test3 = test2.groupby('hour', as_index=False).agg({"Rides": "sum"})
print(test3)

Sorting Columns By Ascending Order

Given this example dataframe,
Date 01012019 01022019 02012019 02022019 03012019 03022019
Period
1 45 21 43 23 32 23
2 42 12 43 11 14 65
3 11 43 24 23 21 12
I will like to sort the date based on the month - (the date is in ddmmyyyy). However, the date is a string when I type(date). I tried to use pd.to_datetime but it failed with an error month must be in 1..12.
Any advice? Thank you!
Specify format of datetimes in to_datetime and then sort_index:
df.columns = pd.to_datetime(df.columns, format='%d%m%Y')
df = df.sort_index(axis=1)
print (df)
2019-01-01 2019-01-02 2019-01-03 2019-02-01 2019-02-02 2019-02-03
Date
1 45 43 32 21 23 23
2 42 43 14 12 11 65
3 11 24 21 43 23 12

Vectorized Operations on a datetime column in pandas

I want to take a column of datetime objects and return a column of integers that are "days from that datetime until today". I can do it in an ugly way, looking for a prettier (and faster) way.
So suppose I have a dataframe with a datetime column like so:
11 2014-03-04 17:16:26+00:00
12 2014-03-10 01:35:56+00:00
13 2014-03-15 02:35:51+00:00
14 2014-03-20 05:55:47+00:00
15 2014-03-26 04:56:33+00:00
Name: datetime, dtype: object
And each element looks like:
datetime.datetime(2014, 3, 4, 17, 16, 26, tzinfo=<UTC>)
Suppose I want to calculate how many days ago each observation occurred, and return that as a simple integer. I know I can just use apply twice, but is there a vectorized/cleaner way to do it?
today = datetime.datetime.today().date()
df_dates = df['datetime'].apply(lambda x: x.date())
days_ago = today - df_dates
Which gives a timedelta64[ns] Series.
11 56 days, 00:00:00
12 50 days, 00:00:00
13 45 days, 00:00:00
14 40 days, 00:00:00
15 34 days, 00:00:00
Name: datetime, dtype: timedelta64[ns]
And then finally if I want it as an integer:
days_ago_as_int = days_ago.apply(lambda x: x.item().days)
days_ago_as_int
11 56
12 50
13 45
14 40
15 34
Name: datetime, dtype: int64
Any thoughts?
Related questions that didn't quite get at what I was asking:
Pandas Python- can datetime be used with vectorized inputs
Pandas add one day to column
Trying Karl D's answer, I'm successfully able to get today's date and the date column as desired, but something goes awry in the subtraction (different datetimes than in the original example, but shouldn't matter, right?):
converted_dates = df['date'].values.astype('datetime64[D]')
today_date = np.datetime64(dt.date.today())
print converted_dates
print today_date
print today_date - converted_dates
[2014-01-16 00:00:00
2014-01-19 00:00:00
2014-01-22 00:00:00
2014-01-26 00:00:00
2014-01-29 00:00:00]
2014-04-30 00:00:00
[16189 days, 0:08:20.637994
16189 days, 0:08:20.637991
16189 days, 0:08:20.637988
16189 days, 0:08:20.637984
16189 days, 0:08:20.637981]
How about (for a column named date)?
import datetime as dt
df['foo'] = (np.datetime64(dt.date.today())
- df['date'].values.astype('datetime64[D]'))
print df
date foo
0 2014-03-04 17:16:26 56 days
1 2014-03-10 01:35:56 50 days
2 2014-03-15 02:35:51 45 days
3 2014-03-20 05:55:47 40 days
4 2014-03-26 04:56:33 34 days
Or if you wanted it as an int:
df['foo'] = (np.datetime64(dt.date.today())
- df['date'].values.astype('datetime64[D]')).astype(int)
print df
date foo
0 2014-03-04 17:16:26 56
1 2014-03-10 01:35:56 50
2 2014-03-15 02:35:51 45
3 2014-03-20 05:55:47 40
4 2014-03-26 04:56:33 34
Or if it was an index
print np.datetime64(dt.date.today()) - df.index.values.astype('datetime64[D]')
[56 50 45 40 34]
Much later Edit: How about this for a work around?
>>> print df
date
0 2014-03-04 17:16:26
1 2014-03-10 01:35:56
2 2014-03-15 02:35:51
3 2014-03-20 05:55:47
4 2014-03-26 04:56:33
Try assigning today's date to a column so it gets converted to a datetime64 column by pandas and then do the arithmetic:
>>> df['today'] = dt.date.today()
>>> df['foo'] = (df['today'].values.astype('datetime64[D]')
- df['date'].values.astype('datetime64[D]'))
>>> print df
date today foo
0 2014-03-04 17:16:26 2014-05-14 71 days
1 2014-03-10 01:35:56 2014-05-14 65 days
2 2014-03-15 02:35:51 2014-05-14 60 days
3 2014-03-20 05:55:47 2014-05-14 55 days
4 2014-03-26 04:56:33 2014-05-14 49 days

Categories