How to calculate date difference from two columns but with different rows and a condition?

How to calculate date difference from two columns but with different rows and a condition? - python

Based on the example of dataframe below, I would like to calculate difference between two datetime for certain index and its cumulative. The expected results are as in the column diff_days and cum_diff days
index
date_a
date_b
diff_days
cum_diff_days
1
1/1/2023
NaT
NaT
-
1
NaT
NaT
NaT
-
1
NaT
3/1/2023
2
2
2
4/1/2023
NaT
NaT
-
2
NaT
NaT
NaT
-
2
NaT
6/1/2023
2
4
3
7/1/2023
NaT
NaT
-
3
NaT
8/1/2023
1
5
3
9/1/2023
NaT
NaT
-
3
NaT
NaT
NaT
-
3
NaT
11/1/2023
2
7
I have checked the other post where it calculates the difference between two dates, unfortunately that one is when the date is in the same row. For my case, I wanted to understand how to calculate the dates if it's on different rows at different column since just subtracting it with df['diff_days'] = df['date_a'] - df['date_b'] will produce aNaTresults. I really appreciate if there is someone enlighten me on this problem.

Try this out
# python 3.10.6
from io import StringIO
import pandas as pd # 1.5.1
string = """index date_a date_b diff_days cum_diff_days
1 1/1/2023 NaT NaT -
1 NaT NaT NaT -
1 NaT 3/1/2023 2 2
2 4/1/2023 NaT NaT -
2 NaT NaT NaT -
2 NaT 6/1/2023 2 4
3 7/1/2023 NaT NaT -
3 NaT 8/1/2023 1 5
3 9/1/2023 NaT NaT -
3 NaT NaT NaT -
3 NaT 11/1/2023 2 7"""
df = pd.read_csv(StringIO(string), sep="\t")
# convert to datetime
df["date_a"] = pd.to_datetime(df.date_a, format="%d/%m/%Y")
df["date_b"] = pd.to_datetime(df.date_b, format="%d/%m/%Y")
# forward-fill `df.date_a` and subtract from `df.date_b`
# then get `.days` attribute to convert to numeric
df["diff_days"] = df.date_b.sub(df.date_a.ffill()).dt.days
# cumulative sum the differences
df["cum_diff_days"] = df.diff_days.cumsum()
# optionally fill the nulls with "-"
df[["diff_days", "cum_diff_days"]] = df[
["diff_days", "cum_diff_days"]
].fillna("-")
print(df)
index date_a date_b diff_days cum_diff_days
0 1 2023-01-01 NaT - -
1 1 NaT NaT - -
2 1 NaT 2023-01-03 2.0 2.0
3 2 2023-01-04 NaT - -
4 2 NaT NaT - -
5 2 NaT 2023-01-06 2.0 4.0
6 3 2023-01-07 NaT - -
7 3 NaT 2023-01-08 1.0 5.0
8 3 2023-01-09 NaT - -
9 3 NaT NaT - -
10 3 NaT 2023-01-11 2.0 7.0
References:
pandas.to_datetime
pandas.Series.ffill
pandas.Series.cumsum

You can use to_datetime, where+bfill to form the grouper, then groupby.agg and join:
# ensure datetime
df[['date_a', 'date_b']] = df[['date_a', 'date_b']].apply(pd.to_datetime, dayfirst=True)
# form grouper based on backfilled date_b
# and use the index as group value
grp = df.index.to_series().where(df['date_b'].notna()).bfill()
# get the first date_a / last date_b (you can also get min/max, first/first…)
# compute the sum and cumsum
# join to original DataFrame
out = df.join(
df.groupby(grp).agg({'date_a': 'first', 'date_b': 'last'})
.assign(diff_days=lambda d: d['date_b'].sub(d['date_a']).dt.days,
cum_diff_days=lambda d: d['diff_days'].cumsum()
)[['diff_days', 'cum_diff_days']]
)
print(out)
Output:
index date_a date_b diff_days cum_diff_days
0.0 1 2023-01-01 NaT NaN NaN
1.0 1 NaT NaT NaN NaN
2.0 1 NaT 2023-01-03 2.0 2.0
3.0 2 2023-01-04 NaT NaN NaN
4.0 2 NaT NaT NaN NaN
5.0 2 NaT 2023-01-06 2.0 4.0
6.0 3 2023-01-07 NaT NaN NaN
7.0 3 NaT 2023-01-08 1.0 5.0
8.0 3 2023-01-09 NaT NaN NaN
9.0 3 NaT NaT NaN NaN
10.0 3 NaT 2023-01-10 1.0 6.0

Proposed script (for testing)
import pandas as pd
df = pd.DataFrame({'date_a': ["1/1/2023", pd.NaT, pd.NaT, "4/1/2023", pd.NaT, pd.NaT,
"7/1/2023", pd.NaT, "9/1/2023", pd.NaT, pd.NaT],
'date_b': [pd.NaT, pd.NaT, "3/1/2023", pd.NaT, pd.NaT, "6/1/2023",
pd.NaT, "8/1/2023", pd.NaT, pd.NaT, "11/1/2023"],
})
r = df.drop_duplicates(keep=False).copy()
r['date_a'] = r['date_a'].shift(1)
r = r.drop_duplicates(keep=False)
r['diff_days'] = (pd.to_datetime(r['date_b'], dayfirst=True)
- pd.to_datetime(r['date_a'], dayfirst=True)).dt.days
r['cum_diff_days'] = r['diff_days'].cumsum()
df = df.join(r[['diff_days', 'cum_diff_days']], how='left')
df['cum_diff_days'] = df['cum_diff_days'].fillna('-') # optional
print(df)
Result
date_a date_b diff_days cum_diff_days
0 1/1/2023 NaT NaN -
1 NaT NaT NaN -
2 NaT 3/1/2023 2.0 2.0
3 4/1/2023 NaT NaN -
4 NaT NaT NaN -
5 NaT 6/1/2023 2.0 4.0
6 7/1/2023 NaT NaN -
7 NaT 8/1/2023 1.0 5.0
8 9/1/2023 NaT NaN -
9 NaT NaT NaN -
10 NaT 11/1/2023 2.0 7.0
Note date_a and date_b keep their original type for further calculation

Related

Pandas: Add and Preserve time component when input file has only date in dataframe

Scenario:
The input file which I read in Pandas has column with sparsely populated date in String/Object format.
I need to add time component, for ex.
2021-08-27 is my input in String format, and 2021-08-28 00:00:00 should by output in datetime64[ns] format
My Trials:
df = pd.read_parquet("sample.parquet")
df.head()
a
b
c
dttime_col
1
1
2
2021-07-12 00:00:00
0
1
0
NaN
1
2
0
NaN
2
1
1
2021-02-04 00:00:00
3
5
2
NaN
df["dttime_col"] = pd.to_datetime(df["dttime_col"])
df["dttime_col"]
Out[16]:
0 2021-07-12
1 NaT
2 NaT
3 2021-02-04
4 NaT
5 2021-05-22
6 NaT
7 2021-10-06
8 2021-01-31
9 NaT
Name: dttime_col, dtype: datetime64[ns]
But as you see here, there is not time component. I tried adding format as %Y-%m-%d %H:%M:%S but still the output is same. Further more, I tried adding Time component as default as a String type.
df["dttime_col"] = df["dttime_col"].dt.strftime("%Y-%m-%d 00:00:00").replace('NaT', np.nan)
Out[17]:
0 2021-07-12 00:00:00
1 NaN
2 NaN
3 2021-02-04 00:00:00
4 NaN
5 2021-05-22 00:00:00
6 NaN
7 2021-10-06 00:00:00
8 2021-01-31 00:00:00
9 NaN
Name: dttime_col, dtype: object
Now this gives me time next to date, but in String/Object format. The moment I convert it back to datetime format, all the HH:MM:SS are removed.
df["dttime_col"].apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d %H:%M:%S") if not isinstance(x, float) else np.nan)
Out[24]:
0 2021-07-12
1 NaT
2 NaT
3 2021-02-04
4 NaT
5 2021-05-22
6 NaT
7 2021-10-06
8 2021-01-31
9 NaT
Name: dttime_col, dtype: datetime64[ns]
It feels like going in circles all again.
Output I expect:
0 2021-07-12 00:00:00
1 NaN
2 NaN
3 2021-02-04 00:00:00
4 NaN
5 2021-05-22 00:00:00
6 NaN
7 2021-10-06 00:00:00
8 2021-01-31 00:00:00
9 NaN
Name: dttime_col, dtype: datetime64[ns]
EDIT 1:
Providing output as asked by #mozway
df["dttime_col"].dt.second
Out[27]:
0 0.0
1 NaN
2 NaN
3 0.0
4 NaN
5 0.0
6 NaN
7 0.0
8 0.0
9 NaN
Name: dttime_col, dtype: float64

How to conditionally select the first non null date from multiple datetime columns in a pandas dataframe?

I have a pandas dataframe with multiple datetime columns. I want to create a new column selecting the date first date that is not null in the first, second or third column, respectively. And if there is no date in all of these 3 columns, then set as today.
An example of my database is:
date1 date2 date3
0 NaT 2019-01-26 NaT
1 2021-04-13 2021-02-27 NaT
2 NaT NaT NaT
3 NaT NaT NaT
4 NaT NaT NaT
I want to create a new column, date 4, with the first date that is not NaT from date 1 to date 3. The result I expect is:
date1 date2 date3 date4
0 NaT 2019-01-26 NaT 2019-01-26 # (date 2)
1 2021-04-13 2021-02-27 NaT 2021-04-13 # (date 1)
2 NaT NaT NaT 2021-06-04 # (today )
3 NaT NaT NaT 2021-06-04 # (today )
4 NaT NaT 2021-02-20 2021-02-20 # (date 3)
I tried this line:
df["date4"] = df.loc[(df["date1"]) | (df["date2"]) | (df["date3"]) | pd.to_datetime("today")]
but it raises the error TypeError: unsupported operand type(s) for |: 'DatetimeArray' and 'DatetimeArray'

Idea is back filling missing values for selected columns, then select first column by position and repalce missing values by today:
df['date4'] = (df[['date1','date2','date3']].bfill(axis=1)
.iloc[:, 0]
.fillna(pd.to_datetime("today").normalize()))
print (df)
date1 date2 date3 date4
0 NaT 2019-01-26 NaT 2019-01-26
1 2021-04-13 2021-02-27 NaT 2021-04-13
2 NaT NaT NaT 2021-06-04
3 NaT NaT NaT 2021-06-04
4 NaT NaT NaT 2021-06-04

Pandas — match last identical row and compute difference

With a DataFrame like the following:
timestamp value
0 2012-01-01 3.0
1 2012-01-05 3.0
2 2012-01-06 6.0
3 2012-01-09 3.0
4 2012-01-31 1.0
5 2012-02-09 3.0
6 2012-02-11 1.0
7 2012-02-13 3.0
8 2012-02-15 2.0
9 2012-02-18 5.0
What would be an elegant and efficient way to add a time_since_last_identical column, so that the previous example would result in:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 5 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 10 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT
The important part of the problem is not necessarily the usage of time delays. Any solution that matches one particular row with the previous row of identical value, and computes something out of those two rows (here, a difference) will be valid.
Note: not interested in apply or loop-based approaches.

A simple, clean and elegant groupby will do the trick:
df['time_since_last_identical'] = df.groupby('value').diff()
Gives:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 4 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 11 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT

Here is a solution using pandas groupby:
out = df.groupby(df['value'])\
.apply(lambda x: pd.to_datetime(x['timestamp'], format = "%Y-%m-%d").diff())\
.reset_index(level = 0, drop = False)\
.reindex(df.index)\
.rename(columns = {'timestamp' : 'time_since_last_identical'})
out = pd.concat([df['timestamp'], out], axis = 1)
That gives the following output:
timestamp value time_since_last_identical
0 2012-01-01 3.0 NaT
1 2012-01-05 3.0 4 days
2 2012-01-06 6.0 NaT
3 2012-01-09 3.0 4 days
4 2012-01-31 1.0 NaT
5 2012-02-09 3.0 31 days
6 2012-02-11 1.0 11 days
7 2012-02-13 3.0 4 days
8 2012-02-15 2.0 NaT
9 2012-02-18 5.0 NaT
It does not exactly match your desired output, but I guess it is a matter of conventions (e.g. whether to include current day or not). Happy to refine if you provide more details.

How to make all non-date values null in Pandas

I have an excel doc where the users put dates and strings in the same column. I want to make every string object null and leave all the dates. How do I do this in pandas? Thanks.

An easy way to convert dates in a DataFrame is with pandas.DataFrame.convert_objects, as mentioned by #Jeff, and it also handles numbers and timedeltas. Here is an example of using it:
# contents of Sheet1 of test.xlsx
x y date1 z date2 date3
1 fum 6/1/2016 7 9/1/2015 string3
2 fo 6/2/2016 alpha string0 10/1/2016
3 fi 6/3/2016 9 9/3/2015 10/2/2016
4 fee 6/4/2016 10 string1 string4
5 dumbledum 6/5/2016 beta string2 10/3/2015
6 dumbledee 6/6/2016 12 9/4/2015 string5
import pandas as pd
xl = pd.ExcelFile('test.xlsx')
df = xl.parse("Sheet1")
df1 = df.convert_objects(convert_dates='coerce')
# 'coerce' required for conversion to NaT on error
df1
Out[7]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT
Individual columns in a DataFrame can be converted using pandas.to_datetime, as pointed out by #Jeff, and with pandas.Series.map, however neither are done in place. For example, with pandas.to_datetime:
import pandas as pd
xl2 = pd.ExcelFile('test.xlsx')
df2 = xl2.parse("Sheet1")
for col in ['date1', 'date2', 'date3']:
df2[col] = pd.to_datetime(df2[col],coerce=True, infer_datetime_format=True)
df2
Out[8]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT
And using pandas.Series.map:
import pandas as pd
import datetime
xl3 = pd.ExcelFile('test.xlsx')
df3 = xl3.parse("Sheet1")
for col in ['date1', 'date2', 'date3']:
df3[col] = df3[col].map(lambda x: x if isinstance(x,(datetime.datetime)) else None)
df3
Out[9]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT
An upfront way to convert dates in an excel doc is while parsing its sheets. This can be done using pandas.ExcelFile.parse's converters option with a function derived from pandas.to_datetime as the functions in the converters dict and enabling it with coerce=True to force errors to NaT. For example:
def converter(x):
return pd.to_datetime(x,coerce=True,infer_datetime_format=True)
# the following also works for this example
# return pd.to_datetime(x,format='%d/%m/%Y',coerce=True)
converters={'date1': converter,'date2': converter, 'date3': converter}
xl4 = pd.ExcelFile('test.xlsx')
df4 = xl4.parse("Sheet1",converters=converters)
df4
Out[10]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT

pandas error creating TimeDeltas from Datetime operation

I have looked at several other related questions here, here, and here, and none of them have come across quite the same problem as me.
I am using Pandas version 0.16.2. I have several columns in a Pandas dataframe, of dtype datetime64[ns]:
In [6]: date_list = ["SubmittedDate","PolicyStartDate", "PaidUpDate", "MaturityDate", "DraftDate", "CurrentValuationDate", "DOB", "InForceDate"]
In [11]: data[date_list].head()
Out[11]:
SubmittedDate PolicyStartDate PaidUpDate MaturityDate DraftDate \
0 NaT 2002-11-18 NaT 2041-03-04 NaT
1 NaT 2015-01-13 NaT NaT NaT
2 NaT 2014-10-15 NaT NaT NaT
3 NaT 2009-08-27 NaT NaT NaT
4 NaT 2007-04-19 NaT 2013-10-01 NaT
CurrentValuationDate DOB InForceDate
0 2015-04-30 1976-03-04 2002-11-18
1 NaT 1949-09-27 2015-01-13
2 NaT 1947-06-15 2014-10-15
3 2015-07-30 1960-06-07 2009-08-27
4 2010-04-21 1950-10-01 2007-04-19
These were originally in string format (e.g. '1976-03-04') which I converted to datetime objects using:
In [7]: for datecol in date_list:
...: data[datecol] = pd.to_datetime(data[datecol], coerce=True, errors = 'raise')
Here are the dtypes for each of these columns:
In [8]: for datecol in date_list:
print data[datecol].dtypes
returns:
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
So far, so good. But what I want to do is create a new column for each of these columns that gives the age in days (as an integer) from a certain date.
In [13]: current_date = pd.to_datetime("2015-07-31")
I first ran this:
In [14]: for i in date_list:
....: data[i+"InDays"] = data[i].apply(lambda x: current_date - x)
However, when I check the dtype of the returned columns:
In [15]: for datecol in date_list:
....: print data[datecol + "InDays"].dtypes
I get these:
object
timedelta64[ns]
object
timedelta64[ns]
object
timedelta64[ns]
timedelta64[ns]
timedelta64[ns]
I don't know why three of them are objects, when they should be timedeltas. What I want to do next is:
In [16]: for i in date_list:
....: data[i+"InDays"] = data[i+"InDays"].dt.days
This approach works fine for the timedelta columns. However, since three of the columns are not timedeltas, I get this error:
AttributeError: Can only use .dt accessor with datetimelike values
I suspect that there are some values in those three columns that are preventing Pandas from converting them to timedeltas. I can't figure out how to work out what those values might be.

The issue occurs because you have three columns with only NaT values, which is causing those columns to be treated as objects when you do apply your condition on it.
You should put some kind of condition in your apply part, to default to some timedelta in case of NaT. Example -
for i in date_list:
data[i+"InDays"] = data[i].apply(lambda x: current_date - x if x is not pd.NaT else pd.Timedelta(0))
Or if you cannot do the above, you should put a condition where you want to do - data[i+"InDays"] = data[i+"InDays"].dt.days , to take it only if the dtype of the series allows it.
Or a simpler way to change the apply part to directly get what you want would be -
for i in date_list:
data[i+"InDays"] = data[i].apply(lambda x: (current_date - x).days if x is not pd.NaT else x)
This would output -
In [110]: data
Out[110]:
SubmittedDate PolicyStartDate PaidUpDate MaturityDate DraftDate \
0 NaT 2002-11-18 NaT 2041-03-04 NaT
1 NaT 2015-01-13 NaT NaT NaT
2 NaT 2014-10-15 NaT NaT NaT
3 NaT 2009-08-27 NaT NaT NaT
4 NaT 2007-04-19 NaT 2013-10-01 NaT
CurrentValuationDate DOB InForceDate SubmittedDateInDays \
0 2015-04-30 1976-03-04 2002-11-18 NaT
1 NaT 1949-09-27 2015-01-13 NaT
2 NaT 1947-06-15 2014-10-15 NaT
3 2015-07-30 1960-06-07 2009-08-27 NaT
4 2010-04-21 1950-10-01 2007-04-19 NaT
PolicyStartDateInDays PaidUpDateInDays MaturityDateInDays DraftDateInDays \
0 4638 NaT -9348 NaT
1 199 NaT NaN NaT
2 289 NaT NaN NaT
3 2164 NaT NaN NaT
4 3025 NaT 668 NaT
CurrentValuationDateInDays DOBInDays InForceDateInDays
0 92 14393 4638
1 NaN 24048 199
2 NaN 24883 289
3 1 20142 2164
4 1927 23679 3025
If you want your NaT to be changed to NaN you can use -
for i in date_list:
data[i+"InDays"] = data[i].apply(lambda x: (current_date - x).days if x is not pd.NaT else np.NaN)
Example/Demo -
In [114]: for i in date_list:
.....: data[i+"InDays"] = data[i].apply(lambda x: (current_date - x).days if x is not pd.NaT else np.NaN)
.....:
In [115]: data
Out[115]:
SubmittedDate PolicyStartDate PaidUpDate MaturityDate DraftDate \
0 NaT 2002-11-18 NaT 2041-03-04 NaT
1 NaT 2015-01-13 NaT NaT NaT
2 NaT 2014-10-15 NaT NaT NaT
3 NaT 2009-08-27 NaT NaT NaT
4 NaT 2007-04-19 NaT 2013-10-01 NaT
CurrentValuationDate DOB InForceDate SubmittedDateInDays \
0 2015-04-30 1976-03-04 2002-11-18 NaN
1 NaT 1949-09-27 2015-01-13 NaN
2 NaT 1947-06-15 2014-10-15 NaN
3 2015-07-30 1960-06-07 2009-08-27 NaN
4 2010-04-21 1950-10-01 2007-04-19 NaN
PolicyStartDateInDays PaidUpDateInDays MaturityDateInDays \
0 4638 NaN -9348
1 199 NaN NaN
2 289 NaN NaN
3 2164 NaN NaN
4 3025 NaN 668
DraftDateInDays CurrentValuationDateInDays DOBInDays InForceDateInDays
0 NaN 92 14393 4638
1 NaN NaN 24048 199
2 NaN NaN 24883 289
3 NaN 1 20142 2164
4 NaN 1927 23679 3025

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculate date difference from two columns but with different rows and a condition? - python

Related

Pandas: Add and Preserve time component when input file has only date in dataframe

How to conditionally select the first non null date from multiple datetime columns in a pandas dataframe?

Pandas — match last identical row and compute difference

How to make all non-date values null in Pandas

pandas error creating TimeDeltas from Datetime operation

Categories

Resources