Add days to date conditionally using Pandas - python

I have a table that I need add days to and create a new column with that information. The problem I am having is that there are two date calculations based on a different column. Here is a similar table to the one I am working with:
Type Name Date
A Abe 6/2/2021
B Joe 6/15/2021
A Jin 6/25/2021
A Jen 6/1/2021
B Pan 6/21/2021
B Pin 6/22/2021
B Hon 6/11/2021
A Hen 6/23/2021
A Bin 6/23/2021
A Ban 6/5/2021
I am trying to get the table to return like this where Type A goes up by 7 days and Type B goes up by 2 business days:
Type Name Date NewDate
A Abe 6/2/2021 6/9/2021
B Joe 6/15/2021 6/19/2021
A Jin 6/25/2021 7/2/2021
A Jen 6/1/2021 6/8/2021
B Pan 6/21/2021 6/23/2021
B Pin 6/22/2021 6/26/2021
B Hon 6/11/2021 6/13/2021
A Hen 6/23/2021 6/30/2021
A Bin 6/23/2021 6/30/2021
A Ban 6/5/2021 6/12/2021
So far I have tried these:
import pandas as pd
from pandas.tseries.offsets import BDay
from datetime import datetime, timedelta
df1['NewDate'] = df1.apply(df1['Date'] + timedelta(days=7)
if x=='Emergency' else df1['Date'] + BDay(2) for x in df1['Type'])
Don't run that, either you will go in an infinite loop or it will take a very long time.
I've also run this:
df1['NewDate'] = [df1['Date'] + timedelta(days=7) if i=='Emergency' else df1['Date'] + BDay(2)
for i in df1.Type] (also tried with df1[Type] same results.
This puts all the rows in a single row (almost looks like how it returns on jupyter notebook with the ...)
I have also tried this:
df1['NewDate'] = df1['Type'].apply(lambda x: df1['Date'] + timedelta(days=7) if x=='Emergency'
else df1['Date'] + BDay(2))
When I run that one it will go through each row on the type and apply the correct logic on the if emergency calculate by 7 days and if not calculate by business day, the problem is that every row returned is calculated on the first row of the entire table.
At this point I am a little lost, any help would be greatly appreciated. For simplicity sakes it can be calculated at plus timedelta(7) and plus timedelta(2). Also what would change if I had to add more conditions like say on Name column.

To use apply, try:
df["Date"] = pd.to_datetime(df["Date"])
df["NewDate"] = df.apply(lambda x: x["Date"]+BDay(2) if x["Type"]=="B" else x["Date"]+pd.DateOffset(days=7), axis=1)
>>> df
Type Name Date NewDate
0 A Abe 2021-06-02 2021-06-09
1 B Joe 2021-06-15 2021-06-17
2 A Jin 2021-06-25 2021-07-02
3 A Jen 2021-06-01 2021-06-08
4 B Pan 2021-06-21 2021-06-23
5 B Pin 2021-06-22 2021-06-24
6 B Hon 2021-06-11 2021-06-15
7 A Hen 2021-06-23 2021-06-30
8 A Bin 2021-06-23 2021-06-30
9 A Ban 2021-06-05 2021-06-12
Alternatively, you can use numpy.where:
import numpy as np
df["NewDate"] = np.where(df["Type"]=="B", df["Date"]+BDay(2), df["Date"]+pd.DateOffset(7))

Related

How to calculate the quantity of business days between two dates using Pandas

I created a pandas df with columns named start_date and current_date. Both columns have a dtype of datetime64[ns]. What's the best way to find the quantity of business days between the current_date and start_date column?
I've tried:
from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay
us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())
projects_df['start_date'] = pd.to_datetime(projects_df['start_date'])
projects_df['current_date'] = pd.to_datetime(projects_df['current_date'])
projects_df['days_count'] = len(pd.date_range(start=projects_df['start_date'], end=projects_df['current_date'], freq=us_bd))
I get the following error message:
Cannot convert input....start_date, dtype: datetime64[ns]] of type <class 'pandas.core.series.Series'> to Timestamp
I'm using Python version 3.10.4.
pd.date_range's parameters need to be datetimes, not series.
For this reason, we can use df.apply to apply the function to each row.
In addition, pandas has bdate_range which is just date_range with freq defaulting to business days, which is exactly what you need.
Using apply and a lambda function, we can create a new Series calculating business days between each start and current date for each row.
projects_df['start_date'] = pd.to_datetime(projects_df['start_date'])
projects_df['current_date'] = pd.to_datetime(projects_df['current_date'])
projects_df['days_count'] = projects_df.apply(lambda row: len(pd.bdate_range(row['start_date'], row['current_date'])), axis=1)
Using a random sample of 10 date pairs, my output is the following:
start_date current_date bdays
0 2022-01-03 17:08:04 2022-05-20 00:53:46 100
1 2022-04-18 09:43:02 2022-06-10 16:56:16 40
2 2022-09-01 12:02:34 2022-09-25 14:59:29 17
3 2022-04-02 14:24:12 2022-04-24 21:05:55 15
4 2022-01-31 02:15:46 2022-07-02 16:16:02 110
5 2022-08-02 22:05:15 2022-08-17 17:25:10 12
6 2022-03-06 05:30:20 2022-07-04 08:43:00 86
7 2022-01-15 17:01:33 2022-08-09 21:48:41 147
8 2022-06-04 14:47:53 2022-12-12 18:05:58 136
9 2022-02-16 11:52:03 2022-10-18 01:30:58 175

Comparison between datetime64[ns] and date

I have DataFrame with values looks like this
Date Value
1 2020-04-12 A
2 2020-05-12 B
3 2020-07-12 C
4 2020-10-12 D
5 2020-11-12 E
and I need to create new DataFrame only with dates from today (7.12) to future (in this example only rows 3, 4 and 5).
I use this code:
df1= df[df["Date"] >= date.today()]
but it gives me TypeError: Invalid comparison between dtype=datetime64[ns] and date
What am I doing wrong? Thank you!
Use the .dt.date on the df['Date'] column. Then you are comparing dates with dates. So:
df1 = df.loc[df['Date'].dt.date >= date.today()]
This will give you:
Date Value
3 2020-12-07 C
4 2020-12-10 D
5 2020-12-11 E
Also make sure that your dateformat is actualy correct. For example by print df['Date'].dt.month to see that it gives all 12's. If not, your date string is not converted correctly. In that case, use df['Date'] = pd.to_datetime(df['Date'], format="%Y-%d-%m") to convert the Date column to the correct datetime format after creating the DataFrame.
Could you please try following. This considers that your dates are in YYYY-DD-MM format, in case its other format then one could change date format accordingly in strftime function.
import pandas as pd
today=pd.datetime.today().strftime("%Y-%d-%m")
df.loc[df['Date'] >= today]
Sample run of solution above: Let's say we have following test DataFrame.
Date Value
1 2020-04-12 A
2 2020-05-12 B
3 2020-07-12 C
4 2020-11-12 D
5 2020-12-12 E
Now when we run the solution above we will get following output:
Date Value
3 2020-07-12 C
4 2020-11-12 D
5 2020-12-12 E

How to define a 4-4-5 week period in Pandas

My company uses a 4-4-5 calendar for reporting purposes. Each month (aka period) is 4-weeks long, except every 3rd month is 5-weeks long.
Pandas seems to have good support for custom calendar periods. However, I'm having trouble figuring out the correct frequency string or custom business month offset to achieve months for a 4-4-5 calendar.
For example:
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(
index=df_index, columns=["a"], data=np.random.randint(0, 100, size=len(df_index))
)
df.groupby(pd.Grouper(level=0, freq="4W-SUN")).mean()
Grouping by 4-weeks starting on Sunday results in the following. The first three month start dates are correct but I need every third month to be 5-weeks long. The 4th month start date should be 2020-06-28.
a
date
2020-03-29 16.000000
2020-04-26 50.250000
2020-05-24 39.071429
2020-06-21 52.464286
2020-07-19 41.535714
2020-08-16 46.178571
2020-09-13 51.857143
2020-10-11 44.250000
2020-11-08 47.714286
2020-12-06 56.892857
2021-01-03 55.821429
2021-01-31 53.464286
2021-02-28 53.607143
2021-03-28 45.037037
Essentially what I'd like to achieve is something like this:
a
date
2020-03-29 20.000000
2020-04-26 50.750000
2020-05-24 49.750000
2020-06-28 49.964286
2020-07-26 52.214286
2020-08-23 47.714286
2020-09-27 46.250000
2020-10-25 53.357143
2020-11-22 52.035714
2020-12-27 39.750000
2021-01-24 43.428571
2021-02-21 49.392857
Pandas currently support only yearly and quarterly 5253 (aka 4-4-5 calendar).
See is pandas.tseries.offsets.FY5253 and pandas.tseries.offsets.FY5253Quarter
df_index = pd.date_range("2020-03-29", "2021-03-27", freq="D", name="date")
df = pd.DataFrame(index=df_index)
df['a'] = np.random.randint(0, 100, df.shape[0])
So indeed you need some more work to get to week level and maintain a 4-4-5 calendar. You could align to quarters using the native pandas offset and fill-in the 4-4-5 week pattern manually.
def date_range(start, end, offset_array, name=None):
start = pd.to_datetime(start)
end = pd.to_datetime(end)
index = []
start -= offset_array[0]
while(start<end):
for x in offset_array:
start += x
if start > end:
break
index.append(start)
return pd.Series(index, name=name)
This function takes a list of offsets rather than a regular frequency period, so it allows to move from date to date following the offsets in the given array:
offset_445 = [
pd.tseries.offsets.FY5253Quarter(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
4*pd.tseries.offsets.Week(weekday=6),
]
df_index_445 = date_range("2020-03-29", "2021-03-27", offset_445, name='date')
Out:
0 2020-05-03
1 2020-05-31
2 2020-06-28
3 2020-08-02
4 2020-08-30
5 2020-09-27
6 2020-11-01
7 2020-11-29
8 2020-12-27
9 2021-01-31
10 2021-02-28
Name: date, dtype: datetime64[ns]
Once the index is created, then it's back to aggregations logic to get the data in the right row buckets. Assuming that you want the mean for the start of each 4 or 5 week period, according to the df_index_445 you have generated, it could look like this:
# calculate the mean on reindex groups
reindex = df_index_445.searchsorted(df.index, side='right') - 1
res = df.groupby(reindex).mean()
# filter valid output
res = res[res.index>=0]
res.index = df_index_445
Out:
a
2020-05-03 47.857143
2020-05-31 53.071429
2020-06-28 49.257143
2020-08-02 40.142857
2020-08-30 47.250000
2020-09-27 52.485714
2020-11-01 48.285714
2020-11-29 56.178571
2020-12-27 51.428571
2021-01-31 50.464286
2021-02-28 53.642857
Note that since the frequency is not regular, pandas will set the datetime index frequency to None.

How do I create a new column with a set timeframe using Pandas datetime64

I’m trying to look at some sales data for a small store. I have a time stamp of when the settlement was made, but sometimes it’s done before midnight and sometimes its done after midnight.
This is giving me data correct for some days and incorrect for others, as anything after midnight should be for the day before. I couldn’t find the correct pandas documentation for what I’m looking for.
Is there an if else solution to create a new column, loop through the NEW_TIMESTAMP column and set a custom timeframe (if after midnight, but before 3pm: set the day before ; else set the day). Every time I write something it either runs forever, or it crashes jupyter.
Data:
What I did is I created another series which says when a day should be offset back by one day, and I multiplied it by a pd.timedelta object, such that 0 turns into "0 days" and 1 turns into "1 day". Subtracting two series gives the right result.
Let me know how the following code works for you.
import pandas as pd
import numpy as np
# copied from https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
def random_dates(start, end, n=15):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
dates = random_dates(start=pd.to_datetime('2020-01-01'),
end=pd.to_datetime('2021-01-01'))
timestamps = pd.Series(dates)
# this takes only the hour component of every datetime
hours = timestamps.dt.hour
# this takes only the hour component of every datetime
dates = timestamps.dt.date
# this compares the hours with 15, and returns a boolean if it is smaller
flag_is_day_before = hours < 15
# now you can set the dates by multiplying the 1s and 0s with a day timedelta
new_dates = dates - pd.to_timedelta(1, unit='day') * flag_is_day_before
df = pd.DataFrame(data=dict(timestamps=timestamps, new_dates=new_dates))
print(df)
This outputs
timestamps new_dates
0 2020-07-10 20:11:13 2020-07-10
1 2020-05-04 01:20:07 2020-05-03
2 2020-03-30 09:17:36 2020-03-29
3 2020-06-01 16:16:58 2020-06-01
4 2020-09-22 04:53:33 2020-09-21
5 2020-08-02 20:07:26 2020-08-02
6 2020-03-22 14:06:53 2020-03-21
7 2020-03-14 14:21:12 2020-03-13
8 2020-07-16 20:50:22 2020-07-16
9 2020-09-26 13:26:55 2020-09-25
10 2020-11-08 17:27:22 2020-11-08
11 2020-11-01 13:32:46 2020-10-31
12 2020-03-12 12:26:21 2020-03-11
13 2020-12-28 08:04:29 2020-12-27
14 2020-04-06 02:46:59 2020-04-05

changing relative times to actual dates in a pandas dataframe

I have currently a dataframe I created by scraping google news headlines. One of my columns is "Time", which refers to time of publication of an article.
Unfortunately, for recent articles, google news uses a "relative" date, e.g., 6 hours ago, or 1 day ago instead of Nov 1, 2017.
I really want to convert these relative dates to be consistent with the other entries (so they also say Nov 12, 2017, for example), but I have no idea where to even start on this.
My thoughts are to maybe make a variable which represents todays date, and then do some kind of search through the dataframe for stuff which doesn't match my format, and then to subtract those relative times with the current date. I would also have to make some sort of filter for stuff which has "hours ago" and just have those equal the current date.
I don't really want a solution but rather a general idea of what to read to try to solve this. Am I supposed to try using numpy?
Example of some rows:
Publication Time Headline
0 The San Diego Union-Tribune 6 hours ago I am not opposed to new therapeutic modalities...
1 Devon Live 13 hours ago If you're looking for a bargain this Christmas...
15 ABS-CBN News 1 day ago Now, Thirdy has a chance to do something that ...
26 New York Times Nov 2, 2017 Shepherds lead their sheep through the centre ...
You can use to_datetime with to_timedelta first and then use combine_first with floor:
#create dates
dates = pd.to_datetime(df['Time'], errors='coerce')
#create times
times = pd.to_timedelta(df['Time'].str.extract('(.*)\s+ago', expand=False))
#combine final datetimes
df['Time'] = (pd.datetime.now() - times).combine_first(dates).dt.floor('D')
print (df)
Publication Time \
0 The San Diego Union-Tribune 2017-11-12
1 Devon Live 2017-11-11
2 ABS-CBN News 2017-11-11
3 New York Times 2017-11-02
Headline
0 I am not opposed to new therapeutic modalities
1 If you're looking for a bargain this Christmas
2 Now, Thirdy has a chance to do something that
3 Shepherds lead their sheep through the centre
print (df['Time'])
0 2017-11-12
1 2017-11-11
2 2017-11-11
3 2017-11-02
Name: Time, dtype: datetime64[ns]
Your approach should work. Use Pandas Timedelta to subtract relative dates from the current date.
For example, given your sample data as:
Publication;Time;Headline
The San Diego Union-Tribune;6 hours ago;I am not opposed to new therapeutic modalities
Devon Live;13 hours ago;If you're looking for a bargain this Christmas
ABS-CBN News;1 day ago;Now, Thirdy has a chance to do something that
New York Times;Nov 2, 2017;Shepherds lead their sheep through the centre
Read in the data from the clipboard (although you could just as easily substitute with read_csv() or some other file format):
import pandas as pd
from datetime import datetime
df = pd.read_clipboard(sep=";")
For the dates that are already in date format, Pandas is smart enough to convert them with to_datetime():
absolute_date = pd.to_datetime(df.Time, errors="coerce")
absolute_date
0 NaT
1 NaT
2 NaT
3 2017-11-02
Name: Time, dtype: datetime64[ns]
For the relative dates, once we drop the "ago" part, they're basically in the right format to convert with pd.Timedelta:
relative_date = (datetime.today() -
df.Time.str.extract("(.*) ago", expand=False).apply(pd.Timedelta))
relative_date
0 2017-11-11 17:05:54.143548
1 2017-11-11 10:05:54.143548
2 2017-11-10 23:05:54.143548
3 NaT
Name: Time, dtype: datetime64[ns]
Now fill in the respective NaN values from each set, absolute and relative (updated to use combine_first(), via Jezrael's answer):
date = relative_date.combine_first(absolute_date)
relative_date
0 2017-11-11 17:06:29.658925
1 2017-11-11 10:06:29.658925
2 2017-11-10 23:06:29.658925
3 2017-11-02 00:00:00.000000
Name: Time, dtype: datetime64[ns]
Finally, pull out just the date from the datetime:
date.dt.date
0 2017-11-11
1 2017-11-11
2 2017-11-10
3 2017-11-02
Name: Time, dtype: object

Categories