numpy datetime and pandas datetime - python

I'm confused by the interoperation between numpy and pandas date objects (or maybe just by numpy's datetime64 in general).
I was trying to count business days using numpy's built-in functionality like so:
np.busday_count("2016-03-01", "2016-03-31", holidays=[np.datetime64("28/03/2016")])
However, numpy apparently can't deal with the inverted date format:
ValueError: Error parsing datetime string "28/03/2016" at position 2
To get around this, I thought I'd just use pandas to_datetime, which can. However:
np.busday_count("2016-03-01", "2016-03-31", holidays=[np.datetime64(pd.to_datetime("28/03/2016"))])
ValueError: Cannot safely convert provided holidays input into an array of dates
Searching around for a bit, it seemed that this was caused by the fact that the chaining of to_datetime and np.datetime64 results in a datetime64[us] object, which apparently the busday_count function cannot accept (is this intended behaviour or a bug?). Thus, my next attempt was:
np.busday_count("2016-03-01", "2016-03-31", holidays=[np.datetime64(pd.Timestamp("28"), "D")])
But:
TypeError: Cannot cast datetime.datetime object from metadata [us] to [D] according to the rule 'same_kind'
And that's me out - why are there so many incompatibilities between all these datetime formats? And how can I get around them?

I've been having a similar issue, using np.is_busday()
The type of datetime64 is vital to get right. Checking the numpy datetime docs, you can specify the numpy datetime type to be D.
This works:
my_holidays=np.array([datetime.datetime.strptime(x,'%m/%d/%y') for x in holidays.Date.values], dtype='datetime64[D]')
day_flags['business_day'] = np.is_busday(days,holidays=my_holidays)
Whereas this throws the same error you got:
my_holidays=np.array([datetime.datetime.strptime(x,'%m/%d/%y') for x in holidays.Date.values], dtype='datetime64')
The only difference is specifying the type of datetime64.
dtype='datetime64[D]'
vs
dtype='datetime64'
Docs are here:
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.datetime.html

I had the same issue while using np.busday_count, later I figured out the problem was with the hours, minutes, seconds, and milliseconds getting added while converting it to datetime object or numpy datetime object.
I just converted to datetime object with only date and not the hours, minutes, seconds, and milliseconds.
The following was my code:
holidays_list.json file:
{
"holidays_2019": [
"04-Mar-2019",
"21-Mar-2019",
"17-Apr-2019",
"19-Apr-2019",
"29-Apr-2019",
"01-May-2019",
"05-Jun-2019",
"12-Aug-2019",
"15-Aug-2019",
"02-Sep-2019",
"10-Sep-2019",
"02-Oct-2019",
"08-Oct-2019",
"28-Oct-2019",
"12-Nov-2019",
"25-Dec-2019"
],
"format": "%d-%b-%Y"
}
code file:
import json
import datetime
import numpy as np
with open('holidays_list.json', 'r') as infile:
data = json.loads(infile.read())
# the following is where I convert the datetime object to date
holidays = list(map(lambda x: datetime.datetime.strptime(
x, data['format']).date(), data['holidays_2019']))
start_date = datetime.datetime.today().date()
end_date = start_date + datetime.timedelta(days=30)
holidays = [start_date + datetime.timedelta(days=1)]
print(np.busday_count(start_date, end_date, holidays=holidays))

Related

Pandas to_datetime function resulting in Unix timestamp instead of Datetime for certain date strings

I am running into an issue where the Pandas to_datetime function results in a Unix timestamp instead of a datetime object for certain rows. The date format in rows that do convert to datetime and rows that convert to Unix timestamp as int appear to be identical. When the problem occurs it seems to affect all the dates in the row.
For example, :
2019-01-02T10:12:28.64Z (stored as str) ends up as 1546424003423000000
While
2019-09-17T11:28:49.35Z (stored as str) converts to a datetime object.
Another date in the same row is 2019-01-02T10:13:23.423Z (stored as str) which is converting to a timestamp as well.
There isn't much code to look at, the conversion happens on a single line:
full_df.loc[mask, 'transaction_changed_datetime'] = pd.to_datetime(full_df['SaleItemChangedOn']) and
full_df.loc[pd.isnull(full_df['completed_date']), 'completed_date'] = pd.to_datetime(full_df['SaleCompletedOn']
I've tried with errors='coerce' on as well but the result is the same. I can deal with this problem later in the code, but I would really like to understand why this is happening.
Edit
As requested, this is the MRE to reproduces the issue on my computer. Some notes on this:
The mask is somehow involved. If I remove the mask it converts fine.
If I only pass in the first row in the Dataframe (single row Dataframe) it converts fine.
import pandas as pd
from pandas import NaT, Timestamp
debug_dict = {'SaleItemChangedOn': ['2019-01-02T10:12:28.64Z', '2019-01-02T10:12:28.627Z'],
'transaction_changed_datetime': [NaT, Timestamp('2019-01-02 11:58:47.900000+0000', tz='UTC')]}
df = pd.DataFrame(debug_dict)
mask = (pd.isnull(df['transaction_changed_datetime']))
df.loc[mask, 'transaction_changed_datetime'] = pd.to_datetime(df['SaleItemChangedOn'])```
When I try the examples you mention:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a':['2019-01-02T10:12:28.64Z', '2019-09-17T11:28:49.35Z', np.nan]})
pd.to_datetime(df['a'])
There doesn't seem to be any issue:
Out[74]:
0 2019-01-02 10:12:28.640000+00:00
1 2019-09-17 11:28:49.350000+00:00
2 NaT
Name: a, dtype: datetime64[ns, UTC]
Could you provide an MRE?
You might want to check if you have more than one column with the same name which is being sent to pd.to_datetime. It solved the datetime being converted to timestamp problem for me.
This appears to have been a bug in Panda that has been fixed with the release of V1.0. The example code above now produces the expected results.

Get current date from datetime.datetime

I'm trying to get current date so I can pass it to the DATE field in SQL. I'm using datetime.datetime, below is my code:
from datetime import datetime
dt = datetime.strptime(datetime.today().date(), "%m/%d/%Y").date()
However, i'm getting this error:
TypeError: strptime() argument 1 must be str, not datetime.datetime
How can I fix the issue above? I'm still confused about datetime and datetime.datetime, and i want to keep using from datetime import datetime not import datetime.
How can I fix the issue above? thank you
If you see closely, the result of following statement,
>>> datetime.today().date()
datetime.date(2019, 9, 30)
>>> str(datetime.today().date())
'2019-09-30'
You'll notice that the datetime returned is - seperated and you'll have to convert it explicitly to a string value. Hence, for the above statement to work, change it to :
dt = datetime.strptime(str(datetime.today().date()), "%Y-%M-%d").date()
Then change it to whatever format you desire for using strftime (in your case >>> "%d/%m/%Y")
>>> dt.strftime("%d/%m/%Y")
'30/01/2019'
Just use datetime.strftime:
from datetime import datetime
dt = datetime.today().strftime("%m/%d/%Y")
print(dt)
Prints:
'09/30/2019'
strptime takes a string and makes a datetime object. Whereas strftime does exactly the opposite, taking a datetime object and returning a string.

How to play around with JSON date format?

I have a JSON date data set and trying to calculate the time difference between two different JSON DateTime.
For example :
'2015-01-28T21:41:38.508275' - '2015-01-28T21:41:34.921589'
Please look at the python code below:
#let's say 'time' is my data frame and JSON formatted time values are under the 'due_date' column
time_spent = time.iloc[2]['due_date'] - time.iloc[10]['due_date']
This doesn't work. I also tried to cast each operand to int, but it also didn't help. What are the different ways to perform this calculation?
I use parser from dateutil.
Something like that:
from dateutil.parser import parse
first_date_obj = parse("2015-01-28T21:41:38.508275")
second_date_obj = parse("2015-02-28T21:41:38.508275")
print(second_date_obj - first_date_obj)
You can also access the year, month, day of the date object like that:
print(first_date_obj.year)
print(first_date_obj.month)
print(first_date_obj.day)
# and so on
from datetime import datetime
date_format = '%Y-%m-%dT%H:%M:%S.%f'
d2 = time.iloc[2]['due_date']
d1 = time.iloc[10]['due_date']
time_spent = datetime.strptime(d2, date_format) - datetime.strptime(d1, date_format)
print(time_spent.days) # 0
print(time_spent.microseconds) # 586686
print(time_spent.seconds) # 3
print(time_spent.total_seconds()) # 3.586686
The easiest thing to do is to use the pandas datetime capability (since you are already using iloc I assume you are using pandas). You can convert the entire dataframe column labeled due_date to be a pandas datetime datatype using
import pandas as pd
time['due_date'] = pd.to_datetime(time['due_date']
then calculate the time difference you want using
time_spent = time.iloc[2]['due_date'] - time.iloc[10]['due_date']
time_spent will be a pandas timedelta object that you can then manipulate as necessary.
See https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html and https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html.

Python: convert 'days since 1990' to datetime object

I have a time series that I have pulled from a netCDF file and I'm trying to convert them to a datetime format. The format of the time series is in 'days since 1990-01-01 00:00:00 +10' (+10 being GMT: +10)
time = nc_data.variables['time'][:]
time_idx = 0 # first timestamp
print time[time_idx]
9465.0
My desired output is a datetime object like so (also GMT +10):
"2015-12-01 00:00:00"
I have tried converting this using the time module without much success although I believe I may be using wrong (I'm still a novice in python and programming).
import time
time_datetime = time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime(time[time_idx]*24*60*60))
Any advice appreciated,
Cheers!
The datetime module's timedelta is probably what you're looking for.
For example:
from datetime import date, timedelta
days = 9465 # This may work for floats in general, but using integers
# is more precise (e.g. days = int(9465.0))
start = date(1990,1,1) # This is the "days since" part
delta = timedelta(days) # Create a time delta object from the number of days
offset = start + delta # Add the specified number of days to 1990
print(offset) # >>> 2015-12-01
print(type(offset)) # >>> <class 'datetime.date'>
You can then use and/or manipulate the offset object, or convert it to a string representation however you see fit.
You can use the same format as for this date object as you do for your time_datetime:
print(offset.strftime('%Y-%m-%d %H:%M:%S'))
Output:
2015-12-01 00:00:00
Instead of using a date object, you could use a datetime object instead if, for example, you were later going to add hours/minutes/seconds/timezone offsets to it.
The code would stay the same as above with the exception of two lines:
# Here, you're importing datetime instead of date
from datetime import datetime, timedelta
# Here, you're creating a datetime object instead of a date object
start = datetime(1990,1,1) # This is the "days since" part
Note: Although you don't state it, but the other answer suggests you might be looking for timezone aware datetimes. If that's the case, dateutil is the way to go in Python 2 as the other answer suggests. In Python 3, you'd want to use the datetime module's tzinfo.
netCDF num2date is the correct function to use here:
import netCDF4
ncfile = netCDF4.Dataset('./foo.nc', 'r')
time = ncfile.variables['time'] # do not cast to numpy array yet
time_convert = netCDF4.num2date(time[:], time.units, time.calendar)
This will convert number of days since 1900-01-01 (i.e. the units of time) to python datetime objects. If time does not have a calendar attribute, you'll need to specify the calendar, or use the default of standard.
We can do this in a couple steps. First, we are going to use the dateutil library to handle our work. It will make some of this easier.
The first step is to get a datetime object from your string (1990-01-01 00:00:00 +10). We'll do that with the following code:
from datetime import datetime
from dateutil.relativedelta import relativedelta
import dateutil.parser
days_since = '1990-01-01 00:00:00 +10'
days_since_dt = dateutil.parser.parse(days_since)
Now, our days_since_dt will look like this:
datetime.datetime(1990, 1, 1, 0, 0, tzinfo=tzoffset(None, 36000))
We'll use that in our next step, of determining the new date. We'll use relativedelta in dateutils to handle this math.
new_date = days_since_dt + relativedelta(days=9465.0)
This will result in your value in new_date having a value of:
datetime.datetime(2015, 12, 1, 0, 0, tzinfo=tzoffset(None, 36000))
This method ensures that the answer you receive continues to be in GMT+10.

Why pandas series return the element of my numpy datetime64 array as timestamp?

I have a pandas Series which can be constructed like the following:
given_time = datetime(2013, 10, 8, 0, 0, 33, 945109,
tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None))
given_times = np.array([given_time] * 3, dtype='datetime64[ns]'))
column = pd.Series(given_times)
The dtype of my Series is datetime64[ns]
However, when I access it: column[1], somehow it becomes of type pandas.tslib.Timestamp, while column.values[1] stays np.datetime64. Does Pandas auto cast my datetime into Timestamp when accessing the item? Is it slow?
Do I need to worry about the difference in types? As far as I see, Timestamp seems not have timezone (numpy.datetime64('2013-10-08T00:00:33.945109000+0100') -> Timestamp('2013-10-07 23:00:33.945109', tz=None))
In practice, I would do datetime arithmetic like take difference, compare to a datetimedelta. Does the possible type inconsistency around my operators affect my use case at all?
Besides, am I encouraged to use pd.to_datetime instead of astype(dtype='datetime64') while converting datetime objects?
Pandas time types are built on top of numpy's datetime64.
In order to continue using the pandas operators, you should keep using pd.to_datetime, rather than as astype(dtype='datetime64'). This is especially true since you'll be taking date time deltas, which pandas handles admirably, for example with resampling, and period definitions.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#up-and-downsampling
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#period
Though I haven't measured, since the pandas times are hiding numpy times, I suspect the conversion, is quite fast. Alternatively, you can just use pandas built in time series definitions and avoid the conversion altogether.
As a rule of thumb, it's good to use the type from the package you'll be using functions from, though. So if you're really only going to use numpy to manipulate the arrays, then stick with numpy date time. Pandas methods => pandas date time.
I had read in the documentation somewhere (apologies, can't find link) that scalar values will be converted to timestamps while arrays will keep their data type. For example:
from datetime import date
import pandas as pd
time_series = pd.Series([date(2010 + x, 1, 1) for x in range(5)])
time_series = time_series.apply(pd.to_datetime)
so that:
In[1]:time_series
Out[1]:
0 2010-01-01
1 2011-01-01
2 2012-01-01
3 2013-01-01
4 2014-01-01
dtype: datetime64[ns]
and yet:
In[2]:time_series.iloc[0]
Out[2]:Timestamp('2010-01-01 00:00:00')
while:
In[3]:time_series.values[0]
In[3]:numpy.datetime64('2009-12-31T19:00:00.000000000-0500')
because iloc requests a scalar from pandas (type conversion to Timestamp) while values requests the full numpy array (no type conversion).
There is similar behavior for series of length one. Additionally, referencing more than one element in the slice (ie iloc[1:10]) will return a series, which will always keep its datatype.
I'm unsure as to why pandas behaves this way.
In[4]: pd.__version__
Out[4]: '0.15.2'

Categories