Python merge multiple date columes with null into 1 column Pandas - python

I try to convert multiple dates format into YYYY-MM-DD, then merge them into 1 column ignore the NULL, but I end up with TypeError: cannot add DatetimeArray and DatetimeArray
import pandas as pd
data = [[ 'Apr 2021'], ['Jan 1'], ['Fri'], [ 'Jan 18']]
df = pd.DataFrame(data, columns = ['date', ])
#convert Month date Jan 1
df['date1']=(pd.to_datetime('2021 '+ df['date'],errors='coerce',format='%Y %b %d'))
# convert Month Year Apr 2021
df['date2']=pd.to_datetime(df['date'], errors='coerce')
#convert fri to this friday
today = datetime.date.today()
friday = today + datetime.timedelta( (4-today.weekday()) % 7 )
this_firday = friday.strftime('%Y-%m-%d')
df['date3']=df['date'].map({'Fri':this_firday})
df['date3'] = pd.to_datetime(df['date3'])
df['dateFinal'] = df['date1'] + df['date2'] + df['date3']
I check the dtypes, they're all datetime, I don't know why. my approach is not efficient, feel free to let me know a better way.

IIUC:
try via bfill() on axis=1:
df['dateFinal'] = df[['date1','date2','date3']].bfill(axis=1).iloc[:,0]
OR
via ffill() on axis=1:
df['dateFinal'] = df[['date1','date2','date3']].ffill(axis=1).iloc[:,-1]
OR
via stack()+to_numpy()
df['dateFinal'] = df[['date1','date2','date3']].stack().to_numpy()
output of df:
date date1 date2 date3 dateFinal
0 Apr 2021 NaT 2021-04-01 NaT 2021-04-01
1 Jan 1 2021-01-01 NaT NaT 2021-01-01
2 Fri NaT NaT 2021-08-13 2021-08-13
3 Jan 18 2021-01-18 NaT NaT 2021-01-18

Related

Parse object index with date, time, and time zone

Python Q. How to parse an object index in a data frame into its date, time, and time zone?
The format is "YYY-MM-DD HH:MM:SS-HH:MM"
where the right "HH:MM" is the timezone.
Example:
Midnight Jan 1st, 2020 in Mountain Time:
2020-01-01 00:00:00-07:00
I'm trying to convert this into seven columns in the data frame:
YYYY, MM, DD, HH, MM, SS, TZ
Use pd.to_datetime to parse a string column into a datetime array
datetimes = pd.to_datetime(column)
once you have this, you can access elements of the datetime object with the .dt datetime accessor:
final = pd.DataFrame({
"year": datetimes.dt.year,
"month": datetimes.dt.month,
"day": datetimes.dt.day,
"hour": datetimes.dt.hour,
"minute": datetimes.dt.minute,
"second": datetimes.dt.second,
"timezone": datetimes.dt.tz,
})
See the pandas user guide section on date/time functionality for more info
df
Date
0 2022-05-01 01:10:04+07:00
1 2022-05-02 05:09:10+07:00
2 2022-05-02 11:22:05+07:00
3 2022-05-02 10:00:30+07:00
df['Date'] = pd.to_datetime(df['Date'])
df['tz']= df['Date'].dt.tz
df['year']= df['Date'].dt.year
df['month']= df['Date'].dt.month
df['month_n']= df['Date'].dt.month_name()
df['day']= df['Date'].dt.day
df['day_n']= df['Date'].dt.day_name()
df['h']= df['Date'].dt.hour
df['mn']= df['Date'].dt.minute
df['s']= df['Date'].dt.second
df['T']= df['Date'].dt.time
df['D']= df['Date'].dt.date
Date tz year month month_n day day_n h mn s T D
0 2022-05-01 01:10:04+07:00 pytz.FixedOffset(420) 2022 5 May 1 Sunday 1 10 4 01:10:04 2022-05-01
1 2022-05-02 05:09:10+07:00 pytz.FixedOffset(420) 2022 5 May 2 Monday 5 9 10 05:09:10 2022-05-02
2 2022-05-02 11:22:05+07:00 pytz.FixedOffset(420) 2022 5 May 2 Monday 11 22 5 11:22:05 2022-05-02
3 2022-05-02 10:00:30+07:00 pytz.FixedOffset(420) 2022 5 May 2 Monday 10 0 30 10:00:30 2022-05-02

Unable to extract date/year/quarter from Pandas

As per the discussion, extracting date/year/quarter in Pandas is as below
df = pd.DataFrame({'date_text': ['Jan 2020', 'May 2020', 'Jun 2020']})
df ['date'] = pd.to_datetime ( df.date_text ).dt.date
df ['year'], df ['month'],df['qtr'] = df ['date'].dt.year, df ['date'].dt.month, df ['date'].dt.quarter
However, the compiler return an error
AttributeError: Can only use .dt accessor with datetimelike values
May I know where did I do wrong?
Fix it by remove the first dt.date
df ['date'] = pd.to_datetime ( df.date_text )
df ['year'], df ['month'], df['qtr'] = df ['date'].dt.year, df ['date'].dt.month, df ['date'].dt.quarter
df
Out[43]:
date_text date year month qtr
0 Jan 2020 2020-01-01 2020 1 1
1 May 2020 2020-05-01 2020 5 2
2 Jun 2020 2020-06-01 2020 6 2

Manipulating Date in Pandas

I'm trying to understand various functions in Python as I come from an R background.
The question I face is: How do I add and subtract days/years/months from pandas based on a condition? In R, I can use the dplyr package where mutate and ifelse will allow me to achieve it together with the lubridate package.
Here is my reproducible data in R:
df = data.frame(date1=c("2017-07-07", "2017-02-11", "2017-05-22", "2017-04-27"))
library(lubridate)
df$date1 <- ymd(df$date1) + years(2)
df$day <- wday(df$date1, label=TRUE)
Input
date1 day
1 2019-07-07 Sun
2 2019-02-11 Mon
3 2019-05-22 Wed
4 2019-04-27 Sat
Task: Add a year to the date if the day is "Sun" and subtract a year from the date if day is "Sat", else IGNORE
R Code
library(dplyr)
df %>% mutate(newdate = ifelse(df$day == "Sun", date1 %m+% years(1),
ifelse(df$day == "Sat", date1 %m-% years(1), date1))) -> df
df$newdate <- as.Date(df$newdate, origin = "1970-01-01")
df$newday <- wday(df$newdate, label=T)
df
Output
date1 day newdate newday
1 2019-07-07 Sun 2020-07-07 Tue
2 2019-02-11 Mon 2019-02-11 Mon
3 2019-05-22 Wed 2019-05-22 Wed
4 2019-04-27 Sat 2018-04-27 Fri
Could someone share with me how to achieve this output using Pandas?
Use DateOffset for add years with Series.dt.strftime and %a fo names of days:
df = pd.DataFrame({'date1':pd.to_datetime(["2017-07-07",
"2017-02-11",
"2017-05-22",
"2017-04-27"])})
df['date1'] += pd.offsets.DateOffset(years=2)
df['day'] = df['date1'].dt.strftime('%a')
For set values by multiple boolean masks use numpy.select:
masks = [df['day'] == 'Sun',
df['day'] == 'Sat']
vals = [df['date1'] + pd.offsets.DateOffset(years=1),
df['date1'] - pd.offsets.DateOffset(years=1)]
df['newdate'] = np.select(masks, vals, default=df['date1'])
df['newday'] = df['newdate'].dt.strftime('%a')
print (df)
date1 day newdate newday
0 2019-07-07 Sun 2020-07-07 Tue
1 2019-02-11 Mon 2019-02-11 Mon
2 2019-05-22 Wed 2019-05-22 Wed
3 2019-04-27 Sat 2018-04-27 Fri
This should work fine for you:
df = pd.DataFrame(data = {'date1':["2017-07-07", "2017-02-11", "2017-05-22", "2017-04-27"], 'day':["Sun", "Mon", "Wed", "Sat"]})
df['date1']= pd.to_datetime(df['date1'])
df['date1'] = df['date1'] + pd.DateOffset(years=2)
def func_year(row):
if row['day'] == 'Sun':
date = row['date1'] + pd.DateOffset(years=1)
elif row['day'] == 'Sat':
date = row['date1'] - pd.DateOffset(years=1)
else:
date = row['date1']
return date
df['new_date'] = df.apply(func_year, axis=1)

Python Pandas: split and change the date format(one with eg:(aug 2018 - nov 2018)) and other with only one?

Split Date e.g. Aug 2018 --> 01-08-2018 ??
Here's my sample input
id year_pass
1 Aug 2018 - Nov 2018
2 Jul 2017
Here's my sample input 2
id year_pass
1 Jul 2018
2 Aug 2017 - Nov 2018
What i did,
I'm able to split the date on the with eg:(aug 2018 - nov 2018)
# splitting the date column on the '-'
year_start, year_end = df['year_pass'].str.split('-')
df.drop('year_pass', axis=1, inplace=True)
# assigning the split values to columns
df['year_start'] = year_start
df['year_end'] = year_end
# converting to datetime objects
df['year_start'] = pd.to_datetime(df['year_start'])
df['year_end'] = pd.to_datetime(df['year_end'])
But couldn't figure out how to do it for both
Output should be:
id year_start year_end
1 01-08-2018 01-11-2018
2 01-07-2018
This is one approach using dt.strftime("%d-%m-%Y").
Ex:
import pandas as pd
df = pd.DataFrame({"year_pass": ["Aug 2018 - Nov 2018", "Jul 2017"]})
df[["year_start", 'year_end']] = df["year_pass"].str.split(" - ", expand=True)
df["year_start"] = pd.to_datetime(df['year_start']).dt.strftime("%d-%m-%Y")
df["year_end"] = pd.to_datetime(df['year_end']).dt.strftime("%d-%m-%Y")
df.drop('year_pass', axis=1, inplace=True)
print(df)
Output:
year_start year_end
0 01-08-2018 01-11-2018
1 01-07-2017 NaT
Edit as per comment:
import pandas as pd
def replaceInitialSpace(val):
if val.startswith(" "):
return " - "+val.strip()
return val
df = pd.DataFrame({"year_pass": [" Jul 2018", "Aug 2018 - Nov 2018", "Jul 2017 "]})
df["year_pass"] = df["year_pass"].apply(replaceInitialSpace)
df[["year_start", 'year_end']] = df["year_pass"].str.split(" - ", expand=True)
df["year_start"] = pd.to_datetime(df['year_start']).dt.strftime("%d-%m-%Y")
df["year_end"] = pd.to_datetime(df['year_end']).dt.strftime("%d-%m-%Y")
df.drop('year_pass', axis=1, inplace=True)
print(df)
Output:
year_start year_end
0 NaT 01-07-2018
1 01-08-2018 01-11-2018
2 01-07-2017 NaT
You could start by splitting the strings by the original dataframe:
# split the original dataframe
df = df.year_pass.str.split(' - ', expand=True)
0 1
id
1 Aug2018 Nov2018
2 Jul2017 None
And then apply pd.to_datetime to turn the strings to datetime objects and format them using strftime:
# rename the columns
df.columns = ['year_start','year_end']
df.apply(lambda x: pd.to_datetime(x, errors='coerce').dt.strftime('%d-%m-%Y'), axis=0)
year_start year_end
id
1 01-08-2018 01-11-2018
2 01-07-2017 NaT
If need datetimes in output is necessary different format - YYYY-MM-DD:
df1 = df.pop('year_pass').str.split('\s+-\s+', expand=True).apply(pd.to_datetime)
df[['year_start','year_end']] = df1
print (df)
id year_start year_end
0 1 2018-08-01 2018-11-01
1 2 2017-07-01 NaT
print (df.dtypes)
id int64
year_start datetime64[ns]
year_end datetime64[ns]
dtype: object
If need change format then get strings, but all datetimelike functions failed:
df1 = (df.pop('year_pass').str.split('\s+-\s+', expand=True)
.apply(lambda x: pd.to_datetime(x).dt.strftime('%d-%m-%Y'))
.replace('NaT',''))
df[['year_start','year_end']] = df1
print (df)
id year_start year_end
0 1 01-08-2018 01-11-2018
1 2 01-07-2017
print (df.dtypes)
id int64
year_start object
year_end object
dtype: object
print (type(df.loc[0, 'year_start']))
<class 'str'>

pd.to_datetime is getting half my dates with flipped day / months

My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30

Categories