Manipulating Date in Pandas - python

I'm trying to understand various functions in Python as I come from an R background.
The question I face is: How do I add and subtract days/years/months from pandas based on a condition? In R, I can use the dplyr package where mutate and ifelse will allow me to achieve it together with the lubridate package.
Here is my reproducible data in R:
df = data.frame(date1=c("2017-07-07", "2017-02-11", "2017-05-22", "2017-04-27"))
library(lubridate)
df$date1 <- ymd(df$date1) + years(2)
df$day <- wday(df$date1, label=TRUE)
Input
date1 day
1 2019-07-07 Sun
2 2019-02-11 Mon
3 2019-05-22 Wed
4 2019-04-27 Sat
Task: Add a year to the date if the day is "Sun" and subtract a year from the date if day is "Sat", else IGNORE
R Code
library(dplyr)
df %>% mutate(newdate = ifelse(df$day == "Sun", date1 %m+% years(1),
ifelse(df$day == "Sat", date1 %m-% years(1), date1))) -> df
df$newdate <- as.Date(df$newdate, origin = "1970-01-01")
df$newday <- wday(df$newdate, label=T)
df
Output
date1 day newdate newday
1 2019-07-07 Sun 2020-07-07 Tue
2 2019-02-11 Mon 2019-02-11 Mon
3 2019-05-22 Wed 2019-05-22 Wed
4 2019-04-27 Sat 2018-04-27 Fri
Could someone share with me how to achieve this output using Pandas?

Use DateOffset for add years with Series.dt.strftime and %a fo names of days:
df = pd.DataFrame({'date1':pd.to_datetime(["2017-07-07",
"2017-02-11",
"2017-05-22",
"2017-04-27"])})
df['date1'] += pd.offsets.DateOffset(years=2)
df['day'] = df['date1'].dt.strftime('%a')
For set values by multiple boolean masks use numpy.select:
masks = [df['day'] == 'Sun',
df['day'] == 'Sat']
vals = [df['date1'] + pd.offsets.DateOffset(years=1),
df['date1'] - pd.offsets.DateOffset(years=1)]
df['newdate'] = np.select(masks, vals, default=df['date1'])
df['newday'] = df['newdate'].dt.strftime('%a')
print (df)
date1 day newdate newday
0 2019-07-07 Sun 2020-07-07 Tue
1 2019-02-11 Mon 2019-02-11 Mon
2 2019-05-22 Wed 2019-05-22 Wed
3 2019-04-27 Sat 2018-04-27 Fri

This should work fine for you:
df = pd.DataFrame(data = {'date1':["2017-07-07", "2017-02-11", "2017-05-22", "2017-04-27"], 'day':["Sun", "Mon", "Wed", "Sat"]})
df['date1']= pd.to_datetime(df['date1'])
df['date1'] = df['date1'] + pd.DateOffset(years=2)
def func_year(row):
if row['day'] == 'Sun':
date = row['date1'] + pd.DateOffset(years=1)
elif row['day'] == 'Sat':
date = row['date1'] - pd.DateOffset(years=1)
else:
date = row['date1']
return date
df['new_date'] = df.apply(func_year, axis=1)

Related

Python merge multiple date columes with null into 1 column Pandas

I try to convert multiple dates format into YYYY-MM-DD, then merge them into 1 column ignore the NULL, but I end up with TypeError: cannot add DatetimeArray and DatetimeArray
import pandas as pd
data = [[ 'Apr 2021'], ['Jan 1'], ['Fri'], [ 'Jan 18']]
df = pd.DataFrame(data, columns = ['date', ])
#convert Month date Jan 1
df['date1']=(pd.to_datetime('2021 '+ df['date'],errors='coerce',format='%Y %b %d'))
# convert Month Year Apr 2021
df['date2']=pd.to_datetime(df['date'], errors='coerce')
#convert fri to this friday
today = datetime.date.today()
friday = today + datetime.timedelta( (4-today.weekday()) % 7 )
this_firday = friday.strftime('%Y-%m-%d')
df['date3']=df['date'].map({'Fri':this_firday})
df['date3'] = pd.to_datetime(df['date3'])
df['dateFinal'] = df['date1'] + df['date2'] + df['date3']
I check the dtypes, they're all datetime, I don't know why. my approach is not efficient, feel free to let me know a better way.
IIUC:
try via bfill() on axis=1:
df['dateFinal'] = df[['date1','date2','date3']].bfill(axis=1).iloc[:,0]
OR
via ffill() on axis=1:
df['dateFinal'] = df[['date1','date2','date3']].ffill(axis=1).iloc[:,-1]
OR
via stack()+to_numpy()
df['dateFinal'] = df[['date1','date2','date3']].stack().to_numpy()
output of df:
date date1 date2 date3 dateFinal
0 Apr 2021 NaT 2021-04-01 NaT 2021-04-01
1 Jan 1 2021-01-01 NaT NaT 2021-01-01
2 Fri NaT NaT 2021-08-13 2021-08-13
3 Jan 18 2021-01-18 NaT NaT 2021-01-18

How to get a date from year, month, week of month and Day of week in Pandas?

I have a Pandas dataframe, which looks like below
I want to create a new column, which tells the exact date from the information from all the above columns. The code should look something like this:
df['Date'] = pd.to_datetime(df['Month']+df['WeekOfMonth']+df['DayOfWeek']+df['Year'])
I was able to find a workaround for your case. You will need to define the dictionaries for the months and the days of the week.
month = {"Jan":"01", "Feb":"02", "March":"03", "Apr": "04", "May":"05", "Jun":"06", "Jul":"07", "Aug":"08", "Sep":"09", "Oct":"10", "Nov":"11", "Dec":"12"}
week = {"Monday":1,"Tuesday":2,"Wednesday":3,"Thursday":4,"Friday":5,"Saturday":6,"Sunday":7}
With this dictionaries the transformation that I used with a custom dataframe was:
rows = [["Dec",5,"Wednesday", "1995"],
["Jan",3,"Wednesday","2013"]]
df = pd.DataFrame(rows, columns=["Month","Week","Weekday","Year"])
df['Date'] = (df["Year"] + "-" + df["Month"].map(month) + "-" + (df["Week"].apply(lambda x: (x - 1)*7) + df["Weekday"].map(week).apply(int) ).apply(str)).astype('datetime64[ns]')
However you have to be careful. With some data that you posted as example there were some dates that exceeds the date range. For example, for
row = ["Oct",5,"Friday","2018"]
The date displayed is 2018-10-33. I recommend using some logic to filter your data in order to avoid this kind of problems.
Let's approach it in 3 steps as follows:
Get the date of month start Month_Start from Year and Month
Calculate the date offsets DateOffset relative to Month_Start from WeekOfMonth and DayOfWeek
Get the actual date Date from Month_Start and DateOffset
Here's the codes:
df['Month_Start'] = pd.to_datetime(df['Year'].astype(str) + df['Month'] + '01', format="%Y%b%d")
import time
df['DateOffset'] = (df['WeekOfMonth'] - 1) * 7 + df['DayOfWeek'].map(lambda x: time.strptime(x, '%A').tm_wday) - df['Month_Start'].dt.dayofweek
df['Date'] = df['Month_Start'] + pd.to_timedelta(df['DateOffset'], unit='D')
Output:
Month WeekOfMonth DayOfWeek Year Month_Start DateOffset Date
0 Dec 5 Wednesday 1995 1995-12-01 26 1995-12-27
1 Jan 3 Wednesday 2013 2013-01-01 15 2013-01-16
2 Oct 5 Friday 2018 2018-10-01 32 2018-11-02
3 Jun 2 Saturday 1980 1980-06-01 6 1980-06-07
4 Jan 5 Monday 1976 1976-01-01 25 1976-01-26
The Date column now contains the dates derived from the information from other columns.
You can remove the working interim columns, if you like, as follows:
df = df.drop(['Month_Start', 'DateOffset'], axis=1)

Days before end of month in pandas

I would like to get the number of days before the end of the month, from a string column representing a date.
I have the following pandas dataframe :
df = pd.DataFrame({'date':['2019-11-22','2019-11-08','2019-11-30']})
df
date
0 2019-11-22
1 2019-11-08
2 2019-11-30
I would like the following output :
df
date days_end_month
0 2019-11-22 8
1 2019-11-08 22
2 2019-11-30 0
The package pd.tseries.MonthEnd with rollforward seemed a good pick, but I can't figure out how to use it to transform a whole column.
Subtract all days of month created by Series.dt.daysinmonth with days extracted by Series.dt.day:
df['date'] = pd.to_datetime(df['date'])
df['days_end_month'] = df['date'].dt.daysinmonth - df['date'].dt.day
Or use offsets.MonthEnd, subtract and convert timedeltas to days by Series.dt.days:
df['days_end_month'] = (df['date'] + pd.offsets.MonthEnd(0) - df['date']).dt.days
print (df)
date days_end_month
0 2019-11-22 8
1 2019-11-08 22
2 2019-11-30 0

Pandas date_range freq='BAS-JUL' does not pick up the first day of the month

I am labeling transactional data with fiscal year ranges. For example, 2018-2019 fiscal has a date range of 7/1/2018 - 6/30/2019. For some reason, when I run the following code, any transaction that happened 7/1/2018 (first day of fiscal) it gets labeled 2017 - 2018 fiscal. Sample data provided as well.
data = [['Start 17-18 Fiscal', '7/1/2017'], ['End 17-18 Fiscal', '6/30/2018'], ['Start 18-19 Fiscal', '7/1/2018'],
['End 18-19 Fiscal', '6/30/2019'], ['Start 19-20 Fiscal', '7/1/2019'], ['End 19-20 Fiscal', '6/30/2020']]
df = pd.DataFrame(data, columns=['Correct Fiscal', 'Date'])
df['Date'] = pd.to_datetime(df['Date'])
y_max = df['Date'].dt.year.max() + 1
y_min = df['Date'].dt.year.min() - 1
labels = [str(x) + ' - ' + str(x+1) for x in np.arange(y_min, y_max, 1)]
df['pay_period'] = pd.cut(df.Date, pd.date_range(str(y_min), str(y_max+1), freq='BAS-JUL'), right=False, labels=labels)
Also, if you look at sample data for 2019 - 2020 fiscal both are labeled as expected. Below is the output.
Correct Fiscal Date pay_period
0 Start 17-18 Fiscal 2017-07-01 2016 - 2017
1 End 17-18 Fiscal 2018-06-30 2017 - 2018
2 Start 18-19 Fiscal 2018-07-01 2017 - 2018
3 End 18-19 Fiscal 2019-06-30 2018 - 2019
4 Start 19-20 Fiscal 2019-07-01 2019 - 2020
5 End 19-20 Fiscal 2020-06-30 2019 - 2020
Updated Solution
So, I was able to fix this and reduce the code to just these two lines:
period_end = pd.to_datetime(df.Date).apply(pd.Period, freq='A-JUN')
df['fiscal_p'] = (period_end - 1).astype(str) + ' - ' + period_end.astype(str)
Thanks to Dan for providing the function answer as well. I can confirm that his answer works as well.
I think the problem is with your "labels" line, not the date range frequency. The labels list is ensuring that the first row is labelled as '2016 -2017', which is incorrect according to your inputs.
Here's an alternative way to get your desired output, using a simple function:
data = [['Start 17-18 Fiscal', '7/1/2017'], ['End 17-18 Fiscal', '6/30/2018'], ['Start 18-19 Fiscal', '7/1/2018'],
['End 18-19 Fiscal', '6/30/2019'], ['Start 19-20 Fiscal', '7/1/2019'], ['End 19-20 Fiscal', '6/30/2020']]
df = pd.DataFrame(data, columns=['Correct Fiscal', 'Date'])
df['Date'] = pd.to_datetime(df['Date'])
def find_pay_period(date):
if date.month == 7:
end_year = date.year + 1
elif date.month == 6:
end_year = date.year
else:
return 'undefined'
return f'{end_year - 1} - {end_year}'
df['pay_period'] = df['Date'].apply(find_pay_period)
Which gives the following output:
Correct Fiscal Date pay_period
0 Start 17-18 Fiscal 2017-07-01 2017 - 2018
1 End 17-18 Fiscal 2018-06-30 2017 - 2018
2 Start 18-19 Fiscal 2018-07-01 2018 - 2019
3 End 18-19 Fiscal 2019-06-30 2018 - 2019
4 Start 19-20 Fiscal 2019-07-01 2019 - 2020
5 End 19-20 Fiscal 2020-06-30 2019 - 2020

Python Pandas: split and change the date format(one with eg:(aug 2018 - nov 2018)) and other with only one?

Split Date e.g. Aug 2018 --> 01-08-2018 ??
Here's my sample input
id year_pass
1 Aug 2018 - Nov 2018
2 Jul 2017
Here's my sample input 2
id year_pass
1 Jul 2018
2 Aug 2017 - Nov 2018
What i did,
I'm able to split the date on the with eg:(aug 2018 - nov 2018)
# splitting the date column on the '-'
year_start, year_end = df['year_pass'].str.split('-')
df.drop('year_pass', axis=1, inplace=True)
# assigning the split values to columns
df['year_start'] = year_start
df['year_end'] = year_end
# converting to datetime objects
df['year_start'] = pd.to_datetime(df['year_start'])
df['year_end'] = pd.to_datetime(df['year_end'])
But couldn't figure out how to do it for both
Output should be:
id year_start year_end
1 01-08-2018 01-11-2018
2 01-07-2018
This is one approach using dt.strftime("%d-%m-%Y").
Ex:
import pandas as pd
df = pd.DataFrame({"year_pass": ["Aug 2018 - Nov 2018", "Jul 2017"]})
df[["year_start", 'year_end']] = df["year_pass"].str.split(" - ", expand=True)
df["year_start"] = pd.to_datetime(df['year_start']).dt.strftime("%d-%m-%Y")
df["year_end"] = pd.to_datetime(df['year_end']).dt.strftime("%d-%m-%Y")
df.drop('year_pass', axis=1, inplace=True)
print(df)
Output:
year_start year_end
0 01-08-2018 01-11-2018
1 01-07-2017 NaT
Edit as per comment:
import pandas as pd
def replaceInitialSpace(val):
if val.startswith(" "):
return " - "+val.strip()
return val
df = pd.DataFrame({"year_pass": [" Jul 2018", "Aug 2018 - Nov 2018", "Jul 2017 "]})
df["year_pass"] = df["year_pass"].apply(replaceInitialSpace)
df[["year_start", 'year_end']] = df["year_pass"].str.split(" - ", expand=True)
df["year_start"] = pd.to_datetime(df['year_start']).dt.strftime("%d-%m-%Y")
df["year_end"] = pd.to_datetime(df['year_end']).dt.strftime("%d-%m-%Y")
df.drop('year_pass', axis=1, inplace=True)
print(df)
Output:
year_start year_end
0 NaT 01-07-2018
1 01-08-2018 01-11-2018
2 01-07-2017 NaT
You could start by splitting the strings by the original dataframe:
# split the original dataframe
df = df.year_pass.str.split(' - ', expand=True)
0 1
id
1 Aug2018 Nov2018
2 Jul2017 None
And then apply pd.to_datetime to turn the strings to datetime objects and format them using strftime:
# rename the columns
df.columns = ['year_start','year_end']
df.apply(lambda x: pd.to_datetime(x, errors='coerce').dt.strftime('%d-%m-%Y'), axis=0)
year_start year_end
id
1 01-08-2018 01-11-2018
2 01-07-2017 NaT
If need datetimes in output is necessary different format - YYYY-MM-DD:
df1 = df.pop('year_pass').str.split('\s+-\s+', expand=True).apply(pd.to_datetime)
df[['year_start','year_end']] = df1
print (df)
id year_start year_end
0 1 2018-08-01 2018-11-01
1 2 2017-07-01 NaT
print (df.dtypes)
id int64
year_start datetime64[ns]
year_end datetime64[ns]
dtype: object
If need change format then get strings, but all datetimelike functions failed:
df1 = (df.pop('year_pass').str.split('\s+-\s+', expand=True)
.apply(lambda x: pd.to_datetime(x).dt.strftime('%d-%m-%Y'))
.replace('NaT',''))
df[['year_start','year_end']] = df1
print (df)
id year_start year_end
0 1 01-08-2018 01-11-2018
1 2 01-07-2017
print (df.dtypes)
id int64
year_start object
year_end object
dtype: object
print (type(df.loc[0, 'year_start']))
<class 'str'>

Categories