How to conditionally select the first non null date from multiple datetime columns in a pandas dataframe? - python

I have a pandas dataframe with multiple datetime columns. I want to create a new column selecting the date first date that is not null in the first, second or third column, respectively. And if there is no date in all of these 3 columns, then set as today.
An example of my database is:
date1 date2 date3
0 NaT 2019-01-26 NaT
1 2021-04-13 2021-02-27 NaT
2 NaT NaT NaT
3 NaT NaT NaT
4 NaT NaT NaT
I want to create a new column, date 4, with the first date that is not NaT from date 1 to date 3. The result I expect is:
date1 date2 date3 date4
0 NaT 2019-01-26 NaT 2019-01-26 # (date 2)
1 2021-04-13 2021-02-27 NaT 2021-04-13 # (date 1)
2 NaT NaT NaT 2021-06-04 # (today )
3 NaT NaT NaT 2021-06-04 # (today )
4 NaT NaT 2021-02-20 2021-02-20 # (date 3)
I tried this line:
df["date4"] = df.loc[(df["date1"]) | (df["date2"]) | (df["date3"]) | pd.to_datetime("today")]
but it raises the error TypeError: unsupported operand type(s) for |: 'DatetimeArray' and 'DatetimeArray'

Idea is back filling missing values for selected columns, then select first column by position and repalce missing values by today:
df['date4'] = (df[['date1','date2','date3']].bfill(axis=1)
.iloc[:, 0]
.fillna(pd.to_datetime("today").normalize()))
print (df)
date1 date2 date3 date4
0 NaT 2019-01-26 NaT 2019-01-26
1 2021-04-13 2021-02-27 NaT 2021-04-13
2 NaT NaT NaT 2021-06-04
3 NaT NaT NaT 2021-06-04
4 NaT NaT NaT 2021-06-04

Related

How to calculate date difference from two columns but with different rows and a condition?

Based on the example of dataframe below, I would like to calculate difference between two datetime for certain index and its cumulative. The expected results are as in the column diff_days and cum_diff days
index
date_a
date_b
diff_days
cum_diff_days
1
1/1/2023
NaT
NaT
-
1
NaT
NaT
NaT
-
1
NaT
3/1/2023
2
2
2
4/1/2023
NaT
NaT
-
2
NaT
NaT
NaT
-
2
NaT
6/1/2023
2
4
3
7/1/2023
NaT
NaT
-
3
NaT
8/1/2023
1
5
3
9/1/2023
NaT
NaT
-
3
NaT
NaT
NaT
-
3
NaT
11/1/2023
2
7
I have checked the other post where it calculates the difference between two dates, unfortunately that one is when the date is in the same row. For my case, I wanted to understand how to calculate the dates if it's on different rows at different column since just subtracting it with df['diff_days'] = df['date_a'] - df['date_b'] will produce aNaTresults. I really appreciate if there is someone enlighten me on this problem.
Try this out
# python 3.10.6
from io import StringIO
import pandas as pd # 1.5.1
string = """index date_a date_b diff_days cum_diff_days
1 1/1/2023 NaT NaT -
1 NaT NaT NaT -
1 NaT 3/1/2023 2 2
2 4/1/2023 NaT NaT -
2 NaT NaT NaT -
2 NaT 6/1/2023 2 4
3 7/1/2023 NaT NaT -
3 NaT 8/1/2023 1 5
3 9/1/2023 NaT NaT -
3 NaT NaT NaT -
3 NaT 11/1/2023 2 7"""
df = pd.read_csv(StringIO(string), sep="\t")
# convert to datetime
df["date_a"] = pd.to_datetime(df.date_a, format="%d/%m/%Y")
df["date_b"] = pd.to_datetime(df.date_b, format="%d/%m/%Y")
# forward-fill `df.date_a` and subtract from `df.date_b`
# then get `.days` attribute to convert to numeric
df["diff_days"] = df.date_b.sub(df.date_a.ffill()).dt.days
# cumulative sum the differences
df["cum_diff_days"] = df.diff_days.cumsum()
# optionally fill the nulls with "-"
df[["diff_days", "cum_diff_days"]] = df[
["diff_days", "cum_diff_days"]
].fillna("-")
print(df)
index date_a date_b diff_days cum_diff_days
0 1 2023-01-01 NaT - -
1 1 NaT NaT - -
2 1 NaT 2023-01-03 2.0 2.0
3 2 2023-01-04 NaT - -
4 2 NaT NaT - -
5 2 NaT 2023-01-06 2.0 4.0
6 3 2023-01-07 NaT - -
7 3 NaT 2023-01-08 1.0 5.0
8 3 2023-01-09 NaT - -
9 3 NaT NaT - -
10 3 NaT 2023-01-11 2.0 7.0
References:
pandas.to_datetime
pandas.Series.ffill
pandas.Series.cumsum
You can use to_datetime, where+bfill to form the grouper, then groupby.agg and join:
# ensure datetime
df[['date_a', 'date_b']] = df[['date_a', 'date_b']].apply(pd.to_datetime, dayfirst=True)
# form grouper based on backfilled date_b
# and use the index as group value
grp = df.index.to_series().where(df['date_b'].notna()).bfill()
# get the first date_a / last date_b (you can also get min/max, first/first…)
# compute the sum and cumsum
# join to original DataFrame
out = df.join(
df.groupby(grp).agg({'date_a': 'first', 'date_b': 'last'})
.assign(diff_days=lambda d: d['date_b'].sub(d['date_a']).dt.days,
cum_diff_days=lambda d: d['diff_days'].cumsum()
)[['diff_days', 'cum_diff_days']]
)
print(out)
Output:
index date_a date_b diff_days cum_diff_days
0.0 1 2023-01-01 NaT NaN NaN
1.0 1 NaT NaT NaN NaN
2.0 1 NaT 2023-01-03 2.0 2.0
3.0 2 2023-01-04 NaT NaN NaN
4.0 2 NaT NaT NaN NaN
5.0 2 NaT 2023-01-06 2.0 4.0
6.0 3 2023-01-07 NaT NaN NaN
7.0 3 NaT 2023-01-08 1.0 5.0
8.0 3 2023-01-09 NaT NaN NaN
9.0 3 NaT NaT NaN NaN
10.0 3 NaT 2023-01-10 1.0 6.0
Proposed script (for testing)
import pandas as pd
df = pd.DataFrame({'date_a': ["1/1/2023", pd.NaT, pd.NaT, "4/1/2023", pd.NaT, pd.NaT,
"7/1/2023", pd.NaT, "9/1/2023", pd.NaT, pd.NaT],
'date_b': [pd.NaT, pd.NaT, "3/1/2023", pd.NaT, pd.NaT, "6/1/2023",
pd.NaT, "8/1/2023", pd.NaT, pd.NaT, "11/1/2023"],
})
r = df.drop_duplicates(keep=False).copy()
r['date_a'] = r['date_a'].shift(1)
r = r.drop_duplicates(keep=False)
r['diff_days'] = (pd.to_datetime(r['date_b'], dayfirst=True)
- pd.to_datetime(r['date_a'], dayfirst=True)).dt.days
r['cum_diff_days'] = r['diff_days'].cumsum()
df = df.join(r[['diff_days', 'cum_diff_days']], how='left')
df['cum_diff_days'] = df['cum_diff_days'].fillna('-') # optional
print(df)
Result
date_a date_b diff_days cum_diff_days
0 1/1/2023 NaT NaN -
1 NaT NaT NaN -
2 NaT 3/1/2023 2.0 2.0
3 4/1/2023 NaT NaN -
4 NaT NaT NaN -
5 NaT 6/1/2023 2.0 4.0
6 7/1/2023 NaT NaN -
7 NaT 8/1/2023 1.0 5.0
8 9/1/2023 NaT NaN -
9 NaT NaT NaN -
10 NaT 11/1/2023 2.0 7.0
Note date_a and date_b keep their original type for further calculation

datetime hour component to column python pandas

I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success
Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.

applying np.where to datetime64[ns] changes dtype to object

I have a dataset containing a lot of dates. I only want to keep the dates larger than the date stated in the first column. Otherwise, I would like to replace them with NaT. This is an example of what the original dataset looks like:
reference_date date_1 date_2 ...
0 2017-01-20 2016-02-09 NaT
1 2016-01-05 NaT NaT
2 2016-01-13 2015-07-22 2016-02-29
3 2016-01-13 2016-04-18 2015-05-11
4 2016-01-11 NaT NaT
... ... ... ...
This is the output I would like to have:
date_1 date_2 ...
0 NaT NaT
1 NaT NaT
2 NaT 2016-02-29
3 2016-04-18 NaT
4 NaT NaT
... ... ...
First I tried
df.loc[df['reference_date'] > df['date_1'], 'date_1'] = pd.NaT
which works for one column at the time but I want to apply this to a lot of columns.
I managed to replace all the unwanted dates with NaT using this code:
cols = df.columns[1:]
result = df[cols].apply(lambda x: np.where(x > df.reference_date, x, pd.NaT), axis = 0)
However, the original dates are transformed to another data type (originally it was datetime64[ns] and now it is object), resulting in large numbers instead of dates:
date_1 date_2 ...
0 NaT NaT
1 NaT NaT
2 NaT 1456704000000000000
3 1460937600000000000 NaT
4 NaT NaT
... ... ...
Any ideas what happens here and how I can keep the original date?
Many thanks

How to add series values to date/datetime object?

I have a pandas dataframe like as shown below
df = pd.DataFrame({'login_date':['5/7/2013 09:27:00 AM','09/08/2013 11:21:00 AM','06/06/2014 08:00:00 AM','06/06/2014 05:00:00 AM','','10/11/1990'],
'DURATION':[21,30,200,34,45,np.NaN})
I would like to add DURATION values to the login_date column
The DURATION is represented in Days type
If there is NA in DURATION column, just replace it with 0.
So, I tried the below
df['DURATION'] = df['DURATION'].fillna(0)
df['login_date'] = pd.to_datetime(df['login_date'])
df['DURATION'] = df['DURATION'].astype('Int64')
df['logout_Date'] = df['login_date'] + pd.offsets.DateOffset(days=df['DURATION'])
However, this results in an error as shown below
TypeError: Invalid type <class 'pandas.core.series.Series'>. Must be int or float.
But I have already converted my DURATION column to int64 type.
How to add a column of values to my logout_date column
Try:
df["logout_date"] = pd.to_datetime(df["login_date"]) + df["DURATION"].fillna(0).apply(lambda x: pd.Timedelta(days=x))
print(df)
Prints:
login_date DURATION logout_date
0 5/7/2013 09:27:00 AM 21.0 2013-05-28 09:27:00
1 09/08/2013 11:21:00 AM 30.0 2013-10-08 11:21:00
2 06/06/2014 08:00:00 AM 200.0 2014-12-23 08:00:00
3 06/06/2014 05:00:00 AM 34.0 2014-07-10 05:00:00
4 45.0 NaT
5 10/11/1990 NaN 1990-10-11 00:00:00

pandas error creating TimeDeltas from Datetime operation

I have looked at several other related questions here, here, and here, and none of them have come across quite the same problem as me.
I am using Pandas version 0.16.2. I have several columns in a Pandas dataframe, of dtype datetime64[ns]:
In [6]: date_list = ["SubmittedDate","PolicyStartDate", "PaidUpDate", "MaturityDate", "DraftDate", "CurrentValuationDate", "DOB", "InForceDate"]
In [11]: data[date_list].head()
Out[11]:
SubmittedDate PolicyStartDate PaidUpDate MaturityDate DraftDate \
0 NaT 2002-11-18 NaT 2041-03-04 NaT
1 NaT 2015-01-13 NaT NaT NaT
2 NaT 2014-10-15 NaT NaT NaT
3 NaT 2009-08-27 NaT NaT NaT
4 NaT 2007-04-19 NaT 2013-10-01 NaT
CurrentValuationDate DOB InForceDate
0 2015-04-30 1976-03-04 2002-11-18
1 NaT 1949-09-27 2015-01-13
2 NaT 1947-06-15 2014-10-15
3 2015-07-30 1960-06-07 2009-08-27
4 2010-04-21 1950-10-01 2007-04-19
These were originally in string format (e.g. '1976-03-04') which I converted to datetime objects using:
In [7]: for datecol in date_list:
...: data[datecol] = pd.to_datetime(data[datecol], coerce=True, errors = 'raise')
Here are the dtypes for each of these columns:
In [8]: for datecol in date_list:
print data[datecol].dtypes
returns:
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
So far, so good. But what I want to do is create a new column for each of these columns that gives the age in days (as an integer) from a certain date.
In [13]: current_date = pd.to_datetime("2015-07-31")
I first ran this:
In [14]: for i in date_list:
....: data[i+"InDays"] = data[i].apply(lambda x: current_date - x)
However, when I check the dtype of the returned columns:
In [15]: for datecol in date_list:
....: print data[datecol + "InDays"].dtypes
I get these:
object
timedelta64[ns]
object
timedelta64[ns]
object
timedelta64[ns]
timedelta64[ns]
timedelta64[ns]
I don't know why three of them are objects, when they should be timedeltas. What I want to do next is:
In [16]: for i in date_list:
....: data[i+"InDays"] = data[i+"InDays"].dt.days
This approach works fine for the timedelta columns. However, since three of the columns are not timedeltas, I get this error:
AttributeError: Can only use .dt accessor with datetimelike values
I suspect that there are some values in those three columns that are preventing Pandas from converting them to timedeltas. I can't figure out how to work out what those values might be.
The issue occurs because you have three columns with only NaT values, which is causing those columns to be treated as objects when you do apply your condition on it.
You should put some kind of condition in your apply part, to default to some timedelta in case of NaT. Example -
for i in date_list:
data[i+"InDays"] = data[i].apply(lambda x: current_date - x if x is not pd.NaT else pd.Timedelta(0))
Or if you cannot do the above, you should put a condition where you want to do - data[i+"InDays"] = data[i+"InDays"].dt.days , to take it only if the dtype of the series allows it.
Or a simpler way to change the apply part to directly get what you want would be -
for i in date_list:
data[i+"InDays"] = data[i].apply(lambda x: (current_date - x).days if x is not pd.NaT else x)
This would output -
In [110]: data
Out[110]:
SubmittedDate PolicyStartDate PaidUpDate MaturityDate DraftDate \
0 NaT 2002-11-18 NaT 2041-03-04 NaT
1 NaT 2015-01-13 NaT NaT NaT
2 NaT 2014-10-15 NaT NaT NaT
3 NaT 2009-08-27 NaT NaT NaT
4 NaT 2007-04-19 NaT 2013-10-01 NaT
CurrentValuationDate DOB InForceDate SubmittedDateInDays \
0 2015-04-30 1976-03-04 2002-11-18 NaT
1 NaT 1949-09-27 2015-01-13 NaT
2 NaT 1947-06-15 2014-10-15 NaT
3 2015-07-30 1960-06-07 2009-08-27 NaT
4 2010-04-21 1950-10-01 2007-04-19 NaT
PolicyStartDateInDays PaidUpDateInDays MaturityDateInDays DraftDateInDays \
0 4638 NaT -9348 NaT
1 199 NaT NaN NaT
2 289 NaT NaN NaT
3 2164 NaT NaN NaT
4 3025 NaT 668 NaT
CurrentValuationDateInDays DOBInDays InForceDateInDays
0 92 14393 4638
1 NaN 24048 199
2 NaN 24883 289
3 1 20142 2164
4 1927 23679 3025
If you want your NaT to be changed to NaN you can use -
for i in date_list:
data[i+"InDays"] = data[i].apply(lambda x: (current_date - x).days if x is not pd.NaT else np.NaN)
Example/Demo -
In [114]: for i in date_list:
.....: data[i+"InDays"] = data[i].apply(lambda x: (current_date - x).days if x is not pd.NaT else np.NaN)
.....:
In [115]: data
Out[115]:
SubmittedDate PolicyStartDate PaidUpDate MaturityDate DraftDate \
0 NaT 2002-11-18 NaT 2041-03-04 NaT
1 NaT 2015-01-13 NaT NaT NaT
2 NaT 2014-10-15 NaT NaT NaT
3 NaT 2009-08-27 NaT NaT NaT
4 NaT 2007-04-19 NaT 2013-10-01 NaT
CurrentValuationDate DOB InForceDate SubmittedDateInDays \
0 2015-04-30 1976-03-04 2002-11-18 NaN
1 NaT 1949-09-27 2015-01-13 NaN
2 NaT 1947-06-15 2014-10-15 NaN
3 2015-07-30 1960-06-07 2009-08-27 NaN
4 2010-04-21 1950-10-01 2007-04-19 NaN
PolicyStartDateInDays PaidUpDateInDays MaturityDateInDays \
0 4638 NaN -9348
1 199 NaN NaN
2 289 NaN NaN
3 2164 NaN NaN
4 3025 NaN 668
DraftDateInDays CurrentValuationDateInDays DOBInDays InForceDateInDays
0 NaN 92 14393 4638
1 NaN NaN 24048 199
2 NaN NaN 24883 289
3 NaN 1 20142 2164
4 NaN 1927 23679 3025

Categories