pandas error creating TimeDeltas from Datetime operation - python

I have looked at several other related questions here, here, and here, and none of them have come across quite the same problem as me.
I am using Pandas version 0.16.2. I have several columns in a Pandas dataframe, of dtype datetime64[ns]:
In [6]: date_list = ["SubmittedDate","PolicyStartDate", "PaidUpDate", "MaturityDate", "DraftDate", "CurrentValuationDate", "DOB", "InForceDate"]
In [11]: data[date_list].head()
Out[11]:
SubmittedDate PolicyStartDate PaidUpDate MaturityDate DraftDate \
0 NaT 2002-11-18 NaT 2041-03-04 NaT
1 NaT 2015-01-13 NaT NaT NaT
2 NaT 2014-10-15 NaT NaT NaT
3 NaT 2009-08-27 NaT NaT NaT
4 NaT 2007-04-19 NaT 2013-10-01 NaT
CurrentValuationDate DOB InForceDate
0 2015-04-30 1976-03-04 2002-11-18
1 NaT 1949-09-27 2015-01-13
2 NaT 1947-06-15 2014-10-15
3 2015-07-30 1960-06-07 2009-08-27
4 2010-04-21 1950-10-01 2007-04-19
These were originally in string format (e.g. '1976-03-04') which I converted to datetime objects using:
In [7]: for datecol in date_list:
...: data[datecol] = pd.to_datetime(data[datecol], coerce=True, errors = 'raise')
Here are the dtypes for each of these columns:
In [8]: for datecol in date_list:
print data[datecol].dtypes
returns:
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
datetime64[ns]
So far, so good. But what I want to do is create a new column for each of these columns that gives the age in days (as an integer) from a certain date.
In [13]: current_date = pd.to_datetime("2015-07-31")
I first ran this:
In [14]: for i in date_list:
....: data[i+"InDays"] = data[i].apply(lambda x: current_date - x)
However, when I check the dtype of the returned columns:
In [15]: for datecol in date_list:
....: print data[datecol + "InDays"].dtypes
I get these:
object
timedelta64[ns]
object
timedelta64[ns]
object
timedelta64[ns]
timedelta64[ns]
timedelta64[ns]
I don't know why three of them are objects, when they should be timedeltas. What I want to do next is:
In [16]: for i in date_list:
....: data[i+"InDays"] = data[i+"InDays"].dt.days
This approach works fine for the timedelta columns. However, since three of the columns are not timedeltas, I get this error:
AttributeError: Can only use .dt accessor with datetimelike values
I suspect that there are some values in those three columns that are preventing Pandas from converting them to timedeltas. I can't figure out how to work out what those values might be.

The issue occurs because you have three columns with only NaT values, which is causing those columns to be treated as objects when you do apply your condition on it.
You should put some kind of condition in your apply part, to default to some timedelta in case of NaT. Example -
for i in date_list:
data[i+"InDays"] = data[i].apply(lambda x: current_date - x if x is not pd.NaT else pd.Timedelta(0))
Or if you cannot do the above, you should put a condition where you want to do - data[i+"InDays"] = data[i+"InDays"].dt.days , to take it only if the dtype of the series allows it.
Or a simpler way to change the apply part to directly get what you want would be -
for i in date_list:
data[i+"InDays"] = data[i].apply(lambda x: (current_date - x).days if x is not pd.NaT else x)
This would output -
In [110]: data
Out[110]:
SubmittedDate PolicyStartDate PaidUpDate MaturityDate DraftDate \
0 NaT 2002-11-18 NaT 2041-03-04 NaT
1 NaT 2015-01-13 NaT NaT NaT
2 NaT 2014-10-15 NaT NaT NaT
3 NaT 2009-08-27 NaT NaT NaT
4 NaT 2007-04-19 NaT 2013-10-01 NaT
CurrentValuationDate DOB InForceDate SubmittedDateInDays \
0 2015-04-30 1976-03-04 2002-11-18 NaT
1 NaT 1949-09-27 2015-01-13 NaT
2 NaT 1947-06-15 2014-10-15 NaT
3 2015-07-30 1960-06-07 2009-08-27 NaT
4 2010-04-21 1950-10-01 2007-04-19 NaT
PolicyStartDateInDays PaidUpDateInDays MaturityDateInDays DraftDateInDays \
0 4638 NaT -9348 NaT
1 199 NaT NaN NaT
2 289 NaT NaN NaT
3 2164 NaT NaN NaT
4 3025 NaT 668 NaT
CurrentValuationDateInDays DOBInDays InForceDateInDays
0 92 14393 4638
1 NaN 24048 199
2 NaN 24883 289
3 1 20142 2164
4 1927 23679 3025
If you want your NaT to be changed to NaN you can use -
for i in date_list:
data[i+"InDays"] = data[i].apply(lambda x: (current_date - x).days if x is not pd.NaT else np.NaN)
Example/Demo -
In [114]: for i in date_list:
.....: data[i+"InDays"] = data[i].apply(lambda x: (current_date - x).days if x is not pd.NaT else np.NaN)
.....:
In [115]: data
Out[115]:
SubmittedDate PolicyStartDate PaidUpDate MaturityDate DraftDate \
0 NaT 2002-11-18 NaT 2041-03-04 NaT
1 NaT 2015-01-13 NaT NaT NaT
2 NaT 2014-10-15 NaT NaT NaT
3 NaT 2009-08-27 NaT NaT NaT
4 NaT 2007-04-19 NaT 2013-10-01 NaT
CurrentValuationDate DOB InForceDate SubmittedDateInDays \
0 2015-04-30 1976-03-04 2002-11-18 NaN
1 NaT 1949-09-27 2015-01-13 NaN
2 NaT 1947-06-15 2014-10-15 NaN
3 2015-07-30 1960-06-07 2009-08-27 NaN
4 2010-04-21 1950-10-01 2007-04-19 NaN
PolicyStartDateInDays PaidUpDateInDays MaturityDateInDays \
0 4638 NaN -9348
1 199 NaN NaN
2 289 NaN NaN
3 2164 NaN NaN
4 3025 NaN 668
DraftDateInDays CurrentValuationDateInDays DOBInDays InForceDateInDays
0 NaN 92 14393 4638
1 NaN NaN 24048 199
2 NaN NaN 24883 289
3 NaN 1 20142 2164
4 NaN 1927 23679 3025

Related

How to calculate date difference from two columns but with different rows and a condition?

Based on the example of dataframe below, I would like to calculate difference between two datetime for certain index and its cumulative. The expected results are as in the column diff_days and cum_diff days
index
date_a
date_b
diff_days
cum_diff_days
1
1/1/2023
NaT
NaT
-
1
NaT
NaT
NaT
-
1
NaT
3/1/2023
2
2
2
4/1/2023
NaT
NaT
-
2
NaT
NaT
NaT
-
2
NaT
6/1/2023
2
4
3
7/1/2023
NaT
NaT
-
3
NaT
8/1/2023
1
5
3
9/1/2023
NaT
NaT
-
3
NaT
NaT
NaT
-
3
NaT
11/1/2023
2
7
I have checked the other post where it calculates the difference between two dates, unfortunately that one is when the date is in the same row. For my case, I wanted to understand how to calculate the dates if it's on different rows at different column since just subtracting it with df['diff_days'] = df['date_a'] - df['date_b'] will produce aNaTresults. I really appreciate if there is someone enlighten me on this problem.
Try this out
# python 3.10.6
from io import StringIO
import pandas as pd # 1.5.1
string = """index date_a date_b diff_days cum_diff_days
1 1/1/2023 NaT NaT -
1 NaT NaT NaT -
1 NaT 3/1/2023 2 2
2 4/1/2023 NaT NaT -
2 NaT NaT NaT -
2 NaT 6/1/2023 2 4
3 7/1/2023 NaT NaT -
3 NaT 8/1/2023 1 5
3 9/1/2023 NaT NaT -
3 NaT NaT NaT -
3 NaT 11/1/2023 2 7"""
df = pd.read_csv(StringIO(string), sep="\t")
# convert to datetime
df["date_a"] = pd.to_datetime(df.date_a, format="%d/%m/%Y")
df["date_b"] = pd.to_datetime(df.date_b, format="%d/%m/%Y")
# forward-fill `df.date_a` and subtract from `df.date_b`
# then get `.days` attribute to convert to numeric
df["diff_days"] = df.date_b.sub(df.date_a.ffill()).dt.days
# cumulative sum the differences
df["cum_diff_days"] = df.diff_days.cumsum()
# optionally fill the nulls with "-"
df[["diff_days", "cum_diff_days"]] = df[
["diff_days", "cum_diff_days"]
].fillna("-")
print(df)
index date_a date_b diff_days cum_diff_days
0 1 2023-01-01 NaT - -
1 1 NaT NaT - -
2 1 NaT 2023-01-03 2.0 2.0
3 2 2023-01-04 NaT - -
4 2 NaT NaT - -
5 2 NaT 2023-01-06 2.0 4.0
6 3 2023-01-07 NaT - -
7 3 NaT 2023-01-08 1.0 5.0
8 3 2023-01-09 NaT - -
9 3 NaT NaT - -
10 3 NaT 2023-01-11 2.0 7.0
References:
pandas.to_datetime
pandas.Series.ffill
pandas.Series.cumsum
You can use to_datetime, where+bfill to form the grouper, then groupby.agg and join:
# ensure datetime
df[['date_a', 'date_b']] = df[['date_a', 'date_b']].apply(pd.to_datetime, dayfirst=True)
# form grouper based on backfilled date_b
# and use the index as group value
grp = df.index.to_series().where(df['date_b'].notna()).bfill()
# get the first date_a / last date_b (you can also get min/max, first/first…)
# compute the sum and cumsum
# join to original DataFrame
out = df.join(
df.groupby(grp).agg({'date_a': 'first', 'date_b': 'last'})
.assign(diff_days=lambda d: d['date_b'].sub(d['date_a']).dt.days,
cum_diff_days=lambda d: d['diff_days'].cumsum()
)[['diff_days', 'cum_diff_days']]
)
print(out)
Output:
index date_a date_b diff_days cum_diff_days
0.0 1 2023-01-01 NaT NaN NaN
1.0 1 NaT NaT NaN NaN
2.0 1 NaT 2023-01-03 2.0 2.0
3.0 2 2023-01-04 NaT NaN NaN
4.0 2 NaT NaT NaN NaN
5.0 2 NaT 2023-01-06 2.0 4.0
6.0 3 2023-01-07 NaT NaN NaN
7.0 3 NaT 2023-01-08 1.0 5.0
8.0 3 2023-01-09 NaT NaN NaN
9.0 3 NaT NaT NaN NaN
10.0 3 NaT 2023-01-10 1.0 6.0
Proposed script (for testing)
import pandas as pd
df = pd.DataFrame({'date_a': ["1/1/2023", pd.NaT, pd.NaT, "4/1/2023", pd.NaT, pd.NaT,
"7/1/2023", pd.NaT, "9/1/2023", pd.NaT, pd.NaT],
'date_b': [pd.NaT, pd.NaT, "3/1/2023", pd.NaT, pd.NaT, "6/1/2023",
pd.NaT, "8/1/2023", pd.NaT, pd.NaT, "11/1/2023"],
})
r = df.drop_duplicates(keep=False).copy()
r['date_a'] = r['date_a'].shift(1)
r = r.drop_duplicates(keep=False)
r['diff_days'] = (pd.to_datetime(r['date_b'], dayfirst=True)
- pd.to_datetime(r['date_a'], dayfirst=True)).dt.days
r['cum_diff_days'] = r['diff_days'].cumsum()
df = df.join(r[['diff_days', 'cum_diff_days']], how='left')
df['cum_diff_days'] = df['cum_diff_days'].fillna('-') # optional
print(df)
Result
date_a date_b diff_days cum_diff_days
0 1/1/2023 NaT NaN -
1 NaT NaT NaN -
2 NaT 3/1/2023 2.0 2.0
3 4/1/2023 NaT NaN -
4 NaT NaT NaN -
5 NaT 6/1/2023 2.0 4.0
6 7/1/2023 NaT NaN -
7 NaT 8/1/2023 1.0 5.0
8 9/1/2023 NaT NaN -
9 NaT NaT NaN -
10 NaT 11/1/2023 2.0 7.0
Note date_a and date_b keep their original type for further calculation

applying np.where to datetime64[ns] changes dtype to object

I have a dataset containing a lot of dates. I only want to keep the dates larger than the date stated in the first column. Otherwise, I would like to replace them with NaT. This is an example of what the original dataset looks like:
reference_date date_1 date_2 ...
0 2017-01-20 2016-02-09 NaT
1 2016-01-05 NaT NaT
2 2016-01-13 2015-07-22 2016-02-29
3 2016-01-13 2016-04-18 2015-05-11
4 2016-01-11 NaT NaT
... ... ... ...
This is the output I would like to have:
date_1 date_2 ...
0 NaT NaT
1 NaT NaT
2 NaT 2016-02-29
3 2016-04-18 NaT
4 NaT NaT
... ... ...
First I tried
df.loc[df['reference_date'] > df['date_1'], 'date_1'] = pd.NaT
which works for one column at the time but I want to apply this to a lot of columns.
I managed to replace all the unwanted dates with NaT using this code:
cols = df.columns[1:]
result = df[cols].apply(lambda x: np.where(x > df.reference_date, x, pd.NaT), axis = 0)
However, the original dates are transformed to another data type (originally it was datetime64[ns] and now it is object), resulting in large numbers instead of dates:
date_1 date_2 ...
0 NaT NaT
1 NaT NaT
2 NaT 1456704000000000000
3 1460937600000000000 NaT
4 NaT NaT
... ... ...
Any ideas what happens here and how I can keep the original date?
Many thanks

How to conditionally select the first non null date from multiple datetime columns in a pandas dataframe?

I have a pandas dataframe with multiple datetime columns. I want to create a new column selecting the date first date that is not null in the first, second or third column, respectively. And if there is no date in all of these 3 columns, then set as today.
An example of my database is:
date1 date2 date3
0 NaT 2019-01-26 NaT
1 2021-04-13 2021-02-27 NaT
2 NaT NaT NaT
3 NaT NaT NaT
4 NaT NaT NaT
I want to create a new column, date 4, with the first date that is not NaT from date 1 to date 3. The result I expect is:
date1 date2 date3 date4
0 NaT 2019-01-26 NaT 2019-01-26 # (date 2)
1 2021-04-13 2021-02-27 NaT 2021-04-13 # (date 1)
2 NaT NaT NaT 2021-06-04 # (today )
3 NaT NaT NaT 2021-06-04 # (today )
4 NaT NaT 2021-02-20 2021-02-20 # (date 3)
I tried this line:
df["date4"] = df.loc[(df["date1"]) | (df["date2"]) | (df["date3"]) | pd.to_datetime("today")]
but it raises the error TypeError: unsupported operand type(s) for |: 'DatetimeArray' and 'DatetimeArray'
Idea is back filling missing values for selected columns, then select first column by position and repalce missing values by today:
df['date4'] = (df[['date1','date2','date3']].bfill(axis=1)
.iloc[:, 0]
.fillna(pd.to_datetime("today").normalize()))
print (df)
date1 date2 date3 date4
0 NaT 2019-01-26 NaT 2019-01-26
1 2021-04-13 2021-02-27 NaT 2021-04-13
2 NaT NaT NaT 2021-06-04
3 NaT NaT NaT 2021-06-04
4 NaT NaT NaT 2021-06-04

Pandas: Subtracting two date columns and the result being an integer

I have two columns in a Pandas data frame that are dates.
I am looking to subtract one column from another and the result being the difference in numbers of days as an integer.
A peek at the data:
df_test.head(10)
Out[20]:
First_Date Second Date
0 2016-02-09 2015-11-19
1 2016-01-06 2015-11-30
2 NaT 2015-12-04
3 2016-01-06 2015-12-08
4 NaT 2015-12-09
5 2016-01-07 2015-12-11
6 NaT 2015-12-12
7 NaT 2015-12-14
8 2016-01-06 2015-12-14
9 NaT 2015-12-15
I have created a new column successfully with the difference:
df_test['Difference'] = df_test['First_Date'].sub(df_test['Second Date'], axis=0)
df_test.head()
Out[22]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82 days
1 2016-01-06 2015-11-30 37 days
2 NaT 2015-12-04 NaT
3 2016-01-06 2015-12-08 29 days
4 NaT 2015-12-09 NaT
However I am unable to get a numeric version of the result:
df_test['Difference'] = df_test[['Difference']].apply(pd.to_numeric)
df_test.head()
Out[25]:
First_Date Second Date Difference
0 2016-02-09 2015-11-19 7.084800e+15
1 2016-01-06 2015-11-30 3.196800e+15
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 2.505600e+15
4 NaT 2015-12-09 NaN
How about:
df_test['Difference'] = (df_test['First_Date'] - df_test['Second Date']).dt.days
This will return difference as int if there are no missing values(NaT) and float if there is.
Pandas have a rich documentation on Time series / date functionality and Time deltas
You can divide column of dtype timedelta by np.timedelta64(1, 'D'), but output is not int, but float, because NaN values:
df_test['Difference'] = df_test['Difference'] / np.timedelta64(1, 'D')
print (df_test)
First_Date Second Date Difference
0 2016-02-09 2015-11-19 82.0
1 2016-01-06 2015-11-30 37.0
2 NaT 2015-12-04 NaN
3 2016-01-06 2015-12-08 29.0
4 NaT 2015-12-09 NaN
5 2016-01-07 2015-12-11 27.0
6 NaT 2015-12-12 NaN
7 NaT 2015-12-14 NaN
8 2016-01-06 2015-12-14 23.0
9 NaT 2015-12-15 NaN
Frequency conversion.
You can use datetime module to help here. Also, as a side note, a simple date subtraction should work as below:
import datetime as dt
import numpy as np
import pandas as pd
#Assume we have df_test:
In [222]: df_test
Out[222]:
first_date second_date
0 2016-01-31 2015-11-19
1 2016-02-29 2015-11-20
2 2016-03-31 2015-11-21
3 2016-04-30 2015-11-22
4 2016-05-31 2015-11-23
5 2016-06-30 2015-11-24
6 NaT 2015-11-25
7 NaT 2015-11-26
8 2016-01-31 2015-11-27
9 NaT 2015-11-28
10 NaT 2015-11-29
11 NaT 2015-11-30
12 2016-04-30 2015-12-01
13 NaT 2015-12-02
14 NaT 2015-12-03
15 2016-04-30 2015-12-04
16 NaT 2015-12-05
17 NaT 2015-12-06
In [223]: df_test['Difference'] = df_test['first_date'] - df_test['second_date']
In [224]: df_test
Out[224]:
first_date second_date Difference
0 2016-01-31 2015-11-19 73 days
1 2016-02-29 2015-11-20 101 days
2 2016-03-31 2015-11-21 131 days
3 2016-04-30 2015-11-22 160 days
4 2016-05-31 2015-11-23 190 days
5 2016-06-30 2015-11-24 219 days
6 NaT 2015-11-25 NaT
7 NaT 2015-11-26 NaT
8 2016-01-31 2015-11-27 65 days
9 NaT 2015-11-28 NaT
10 NaT 2015-11-29 NaT
11 NaT 2015-11-30 NaT
12 2016-04-30 2015-12-01 151 days
13 NaT 2015-12-02 NaT
14 NaT 2015-12-03 NaT
15 2016-04-30 2015-12-04 148 days
16 NaT 2015-12-05 NaT
17 NaT 2015-12-06 NaT
Now, change type to datetime.timedelta, and then use the .days method on valid timedelta objects.
In [226]: df_test['Diffference'] = df_test['Difference'].astype(dt.timedelta).map(lambda x: np.nan if pd.isnull(x) else x.days)
In [227]: df_test
Out[227]:
first_date second_date Difference Diffference
0 2016-01-31 2015-11-19 73 days 73
1 2016-02-29 2015-11-20 101 days 101
2 2016-03-31 2015-11-21 131 days 131
3 2016-04-30 2015-11-22 160 days 160
4 2016-05-31 2015-11-23 190 days 190
5 2016-06-30 2015-11-24 219 days 219
6 NaT 2015-11-25 NaT NaN
7 NaT 2015-11-26 NaT NaN
8 2016-01-31 2015-11-27 65 days 65
9 NaT 2015-11-28 NaT NaN
10 NaT 2015-11-29 NaT NaN
11 NaT 2015-11-30 NaT NaN
12 2016-04-30 2015-12-01 151 days 151
13 NaT 2015-12-02 NaT NaN
14 NaT 2015-12-03 NaT NaN
15 2016-04-30 2015-12-04 148 days 148
16 NaT 2015-12-05 NaT NaN
17 NaT 2015-12-06 NaT NaN
Hope that helps.
I feel that the overall answer does not handle if the dates 'wrap' around a year. This would be useful in understanding proximity to a date being accurate by day of year. In order to do these row operations, I did the following. (I had this used in a business setting in renewing customer subscriptions).
def get_date_difference(row, x, y):
try:
# Calcuating the smallest date difference between the start and the close date
# There's some tricky logic in here to calculate for determining date difference
# the other way around (Dec -> Jan is 1 month rather than 11)
sub_start_date = int(row[x].strftime('%j')) # day of year (1-366)
close_date = int(row[y].strftime('%j')) # day of year (1-366)
later_date_of_year = max(sub_start_date, close_date)
earlier_date_of_year = min(sub_start_date, close_date)
days_diff = later_date_of_year - earlier_date_of_year
# Calculates the difference going across the next year (December -> Jan)
days_diff_reversed = (365 - later_date_of_year) + earlier_date_of_year
return min(days_diff, days_diff_reversed)
except ValueError:
return None
Then the function could be:
dfAC_Renew['date_difference'] = dfAC_Renew.apply(get_date_difference, x = 'customer_since_date', y = 'renewal_date', axis = 1)
Create a vectorized method
def calc_xb_minus_xa(df):
time_dict = {
'<Minute>': 'm',
'<Hour>': 'h',
'<Day>': 'D',
'<Week>': 'W',
'<Month>': 'M',
'<Year>': 'Y'
}
time_delta = df.at[df.index[0], 'end_time'] - df.at[df.index[0], 'open_time']
offset_base_name = str(to_offset(time_delta).base)
time_term = time_dict.get(offset_base_name)
result = (df.end_time - df.open_time) / np.timedelta64(1, time_term)
return result
Then in your df do:
df['x'] = calc_xb_minus_xa(df)
This will work for minutes, hours, days, weeks, month and Year.
open_time and end_time need to change according your df

How to make all non-date values null in Pandas

I have an excel doc where the users put dates and strings in the same column. I want to make every string object null and leave all the dates. How do I do this in pandas? Thanks.
An easy way to convert dates in a DataFrame is with pandas.DataFrame.convert_objects, as mentioned by #Jeff, and it also handles numbers and timedeltas. Here is an example of using it:
# contents of Sheet1 of test.xlsx
x y date1 z date2 date3
1 fum 6/1/2016 7 9/1/2015 string3
2 fo 6/2/2016 alpha string0 10/1/2016
3 fi 6/3/2016 9 9/3/2015 10/2/2016
4 fee 6/4/2016 10 string1 string4
5 dumbledum 6/5/2016 beta string2 10/3/2015
6 dumbledee 6/6/2016 12 9/4/2015 string5
import pandas as pd
xl = pd.ExcelFile('test.xlsx')
df = xl.parse("Sheet1")
df1 = df.convert_objects(convert_dates='coerce')
# 'coerce' required for conversion to NaT on error
df1
Out[7]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT
Individual columns in a DataFrame can be converted using pandas.to_datetime, as pointed out by #Jeff, and with pandas.Series.map, however neither are done in place. For example, with pandas.to_datetime:
import pandas as pd
xl2 = pd.ExcelFile('test.xlsx')
df2 = xl2.parse("Sheet1")
for col in ['date1', 'date2', 'date3']:
df2[col] = pd.to_datetime(df2[col],coerce=True, infer_datetime_format=True)
df2
Out[8]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT
And using pandas.Series.map:
import pandas as pd
import datetime
xl3 = pd.ExcelFile('test.xlsx')
df3 = xl3.parse("Sheet1")
for col in ['date1', 'date2', 'date3']:
df3[col] = df3[col].map(lambda x: x if isinstance(x,(datetime.datetime)) else None)
df3
Out[9]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT
An upfront way to convert dates in an excel doc is while parsing its sheets. This can be done using pandas.ExcelFile.parse's converters option with a function derived from pandas.to_datetime as the functions in the converters dict and enabling it with coerce=True to force errors to NaT. For example:
def converter(x):
return pd.to_datetime(x,coerce=True,infer_datetime_format=True)
# the following also works for this example
# return pd.to_datetime(x,format='%d/%m/%Y',coerce=True)
converters={'date1': converter,'date2': converter, 'date3': converter}
xl4 = pd.ExcelFile('test.xlsx')
df4 = xl4.parse("Sheet1",converters=converters)
df4
Out[10]:
x y date1 z date2 date3
0 1 fum 2016-06-01 7 2015-09-01 NaT
1 2 fo 2016-06-02 alpha NaT 2016-10-01
2 3 fi 2016-06-03 9 2015-09-03 2016-10-02
3 4 fee 2016-06-04 10 NaT NaT
4 5 dumbledum 2016-06-05 beta NaT 2015-10-03
5 6 dumbledee 2016-06-06 12 2015-09-04 NaT

Categories