Trying to implement the model of time series predicting in python but facing with issues with datetime data.
So I have a dataframe 'df' with two columns of datetime and float types:
Then I try to build an array using values method. But smth strange happens and it displays the date in strange format with timestamps and time:
And basically because of it, I can not implement the model receiving the following messages for example:"Cannot add integral value to Timestamp without freq."
So what seems to be the problem and how can it be solved?
It's complicated.
First of all, when creating a numpy array, all types will be the same. However, datetime64 is not the same as int. So we'll have to resolve that, and we will.
Second, you tried to do this with df.values. Which makes sense, however, what happens is that pandas makes the whole df into dtype=object then into an object array. The problem with that is that Timestamps get left as Timestamps which is getting in your way.
So I'd convert them on my own like this
a = np.column_stack([df[c].values.astype(int) for c in ['transaction_date', 'amount']])
a
array([[1454284800000000000, 1],
[1454371200000000000, 2],
[1454457600000000000, 3],
[1454544000000000000, 4],
[1454630400000000000, 5]])
We can always convert the first column of a back like this
a[:, 0].astype(df.transaction_date.values.dtype)
array(['2016-02-01T00:00:00.000000000', '2016-02-02T00:00:00.000000000',
'2016-02-03T00:00:00.000000000', '2016-02-04T00:00:00.000000000',
'2016-02-05T00:00:00.000000000'], dtype='datetime64[ns]')
you can convert your integer into a timedelta, and do the calculations as you did before:
from datetime import timedelta
interval = timedelta(days = 5)
#5 days later
time_stamp += interval
Related
I am using pandas to convert a column having date and time into seconds by using the following code:
df['date_time'] = pd.to_timedelta(df['date_time'])
df['date_time'] = df['date_time'].dt.total_seconds()
The dataset is:
If i use the following code:
df['date_time'] = pd.to_datetime(df['date_time'], errors='coerce')
df['date_time'] = df['date_time'].dt.total_seconds()
print(df.head())
Then i get the following error:
AttributeError: 'DatetimeProperties' object has no attribute 'total_seconds'
So as the case with dt.timestamp
So my queries are:
Is it necessary to convert the time to seconds for training the model? If yes then how and if not then why?
This one is related to two other columns named weather_m and weather_d, weather_m has 38 different types of entries or we say 38 different categories out of which only one will be true at a time and weather_m has 11 but the case is same as with weather_m. So i am confused a bit here whether to split this categorical data and merge 49 new columns in the original dataset and dropping weather_m and weather_d to train the model or use LabelEncoder instead of pd.get_dummies?
Converting a datetime or timestamp into a timedelta (duration) doesn't make sense. It'd only make sense if you want the duration between the given timestamp, and some other reference date. Then you can get the timedelta just by using - to get the difference between 2 dates.
Since your datetime column is a string you also need to convert it to a datetime first: df['date_time'] = pd.to_datetime(df['date_time'], format='%m/%d/%Y %H:%M').
Then you can try something like: ref_date = datetime.datetime(1970, 1, 1, 0, 0); df['secs_since_epoch'] = (df['date_time'] - ref_date).dt.total_seconds()
If the different categories are totally distinct from each other (and they don't e.g. have an implicit ordering to them) then you should use one hot encoding yes, replacing the original columns. Since the number of categories is small that should be fine.
(though it also depends what exactly you're gonna run on this data. some libraries might be ok with the original categorical column, and do the conversion implicitly for you)
Currently I am working with a big dataframe (12x47800). One of the twelve columns is a column consisting of an integer number of seconds. I want to change this column to a column consisting of a datetime.time format. Schedule is my dataframe where I try changing the column named 'depTime'. Since I want it to be a datetime.time and it could cross midnight i added the if-statement. This 'works' but really slow as one could imagine. Is there a faster way to do this?
My current code, the only one I could get working is:
for i in range(len(schedule)):
t_sec = schedule.iloc[i].depTime
[t_min, t_sec] = divmod(t_sec,60)
[t_hour,t_min] = divmod(t_min,60)
if t_hour>23:
t_hour -= 23
schedule['depTime'].iloc[i] = dt.time(int(t_hour),int(t_min),int(t_sec))
Thanks in advance guys.
Ps: I'm pretty new to Python, so if anybody could help me I would be very gratefull :)
I'm adding a new solution which is much faster than the original since it relies on pandas vectorized functions instead of looping (pandas apply functions are essentially optimized loops on the data).
I tested it with a sample similar in size to yours and the difference is from 778ms to 21.3ms. So I definitely recommend the new version.
Both solutions are based on transforming your seconds integers into timedelta format and adding it to a reference datetime. Then, I simply capture the time component of the resulting datetimes.
New (Faster) Option:
import datetime as dt
seconds = pd.Series(np.random.rand(50)*100).astype(int) # Generating test data
start = dt.datetime(2019,1,1,0,0) # You need a reference point
datetime_series = seconds.astype('timedelta64[s]') + start
time_series = datetime_series.dt.time
time_series
Original (slower) Answer:
Not the most elegant solution, but it does the trick.
import datetime as dt
seconds = pd.Series(np.random.rand(50)*100).astype(int) # Generating test data
start = dt.datetime(2019,1,1,0,0) # You need a reference point
time_series = seconds.apply(lambda x: start + pd.Timedelta(seconds=x)).dt.time
You should try not to do a full scan on a dataframe, but instead use vectorized access because it is normally much more efficient.
Fortunately, pandas has a function that does exactly what you are asking for, to_timedelta:
schedule['depTime'] = pd.to_timedelta(schedule['depTime'], unit='s')
It is not really a datetime format, but it is the pandas equivalent of a datetime.timedelta and is a convenient type for processing times. You could use to_datetime but will end with a full datetime close to 1970-01-01...
If you really need datetime.time objects, you can get them that way:
schedule['depTime'] = pd.to_datetime(schedule['depTime'], unit='s').dt.time
but they are less convenient to use in a pandas dataframe.
I was having trouble manipulating a time-series data provided to me for a project. The data contains the number of flight bookings made on a website per second in a duration of 30 minutes. Here is a part of the column containing the timestamp
>>> df['Date_time']
0 7/14/2017 2:14:14 PM
1 7/14/2017 2:14:37 PM
2 7/14/2017 2:14:38 PM
I wanted to do
>>> pd.set_index('Date_time')
and use the datetime and timedelta methods provided by pandas to generate the timestamp to be used as index to access and modify any value in any cell.
Something like
>>> td=datetime(year=2017,month=7,day=14,hour=2,minute=14,second=36)
>>> td1=dt.timedelta(minutes=1,seconds=58)
>>> ti1=td1+td
>>> df.at[ti1,'column_name']=65000
But the timestamp generated is of the form
>>> print(ti1)
2017-07-14 02:16:34
Which cannot be directly used as an index in my case as can be clearly seen. Is there a workaround for the above case without writing additional methods myself?
I want to do the above as it provides me greater level of control over the data than looking for the default numerical index for each row I want to update and hence will prove more efficient accordig to me
Can you check the dtype of the 'Date_time' column and confirm for me that it is string (object) ?
df.dtypes
If so, you should be able to cast the values to pd.Timestamp by using the following.
df['timestamp'] = df['Date_time'].apply(pd.Timestamp)
When we call .dtypes now, we should have a 'timestamp' field of type datetime64[ns], which allows us to use builtin pandas methods more easily.
I would suggest it is prudent to index the dataframe by the timestamp too, achieved by setting the index equal to that column.
df.set_index('timestamp', inplace=True)
We should now be able to use some more useful methods such as
df.loc[timestamp_to_check, :]
df.loc[start_time_stamp : end_timestamp, : ]
df.asof(timestamp_to_check)
to lookup values from the DataFrame based upon passing a datetime.datetime / pd.Timestamp / np.datetime64 into the above. Note that you will need to cast any string (object) 'lookups' to one of the above types in order to make use of the above correctly.
I prefer to use pd.Timestamp() - https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Timestamp.html to handle datetime conversion from strings unless I am explicitly certain of what format the datetime string is always going to be in.
I have a list of time creation of files obtained using os.path.getmtime
time_creation_sorted
Out[45]:
array([ 1.47133334e+09, 1.47133437e+09, 1.47133494e+09,
1.47133520e+09, 1.47133577e+09, 1.47133615e+09,
1.47133617e+09, 1.47133625e+09, 1.47133647e+09])
I know how to convert those elements in hour minute seconds.
datetime.fromtimestamp(time_creation_sorted[1]).strftime('%H:%M:%S')
Out[62]: '09:59:26'
What I would like to do is to create another table that contains the time elapsed since the first element but expressed in hour:min:sec such that it would look like:
array(['00:00:00','00:16:36',...])
But I have not manage to find how to do that. Taking naively the difference between the elements of time_creation_sorted and trying to convert to hour:min:sec does not give something logical:
datetime.fromtimestamp(time_creation_sorted[1]-time_creation_sorted[0]).strftime('%H:%M:%S')
Out[67]: '01:17:02'
Any idea or link on how to do that?
Thanks,
Grégory
You need to rearrange some parts of your code in order to get the desired output.
First you should convert the time stamps to datetime objects which differences result into so called timedelta objects. The __str__() representation of those timedelta objects are exactly what you want:
from datetime import datetime
tstamps = [1.47133334e+09, 1.47133437e+09, 1.47133494e+09, 1.47133520e+09, 1.47133577e+09, 1.47133615e+09, 1.47133617e+09, 1.47133625e+09, 1.47133647e+09]
tstamps = [datetime.fromtimestamp(stamp) for stamp in tstamps]
tstamps_relative = [(t - tstamps[0]).__str__() for t in tstamps]
print(tstamps_relative)
giving:
['0:00:00', '0:17:10', '0:26:40', '0:31:00', '0:40:30', '0:46:50', '0:47:10', '0:48:30', '0:52:10']
Check out timedelta objects, it gives difference between two dates or times
https://docs.python.org/2/library/datetime.html#timedelta-objects
I have a pandas Series which can be constructed like the following:
given_time = datetime(2013, 10, 8, 0, 0, 33, 945109,
tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None))
given_times = np.array([given_time] * 3, dtype='datetime64[ns]'))
column = pd.Series(given_times)
The dtype of my Series is datetime64[ns]
However, when I access it: column[1], somehow it becomes of type pandas.tslib.Timestamp, while column.values[1] stays np.datetime64. Does Pandas auto cast my datetime into Timestamp when accessing the item? Is it slow?
Do I need to worry about the difference in types? As far as I see, Timestamp seems not have timezone (numpy.datetime64('2013-10-08T00:00:33.945109000+0100') -> Timestamp('2013-10-07 23:00:33.945109', tz=None))
In practice, I would do datetime arithmetic like take difference, compare to a datetimedelta. Does the possible type inconsistency around my operators affect my use case at all?
Besides, am I encouraged to use pd.to_datetime instead of astype(dtype='datetime64') while converting datetime objects?
Pandas time types are built on top of numpy's datetime64.
In order to continue using the pandas operators, you should keep using pd.to_datetime, rather than as astype(dtype='datetime64'). This is especially true since you'll be taking date time deltas, which pandas handles admirably, for example with resampling, and period definitions.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#up-and-downsampling
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#period
Though I haven't measured, since the pandas times are hiding numpy times, I suspect the conversion, is quite fast. Alternatively, you can just use pandas built in time series definitions and avoid the conversion altogether.
As a rule of thumb, it's good to use the type from the package you'll be using functions from, though. So if you're really only going to use numpy to manipulate the arrays, then stick with numpy date time. Pandas methods => pandas date time.
I had read in the documentation somewhere (apologies, can't find link) that scalar values will be converted to timestamps while arrays will keep their data type. For example:
from datetime import date
import pandas as pd
time_series = pd.Series([date(2010 + x, 1, 1) for x in range(5)])
time_series = time_series.apply(pd.to_datetime)
so that:
In[1]:time_series
Out[1]:
0 2010-01-01
1 2011-01-01
2 2012-01-01
3 2013-01-01
4 2014-01-01
dtype: datetime64[ns]
and yet:
In[2]:time_series.iloc[0]
Out[2]:Timestamp('2010-01-01 00:00:00')
while:
In[3]:time_series.values[0]
In[3]:numpy.datetime64('2009-12-31T19:00:00.000000000-0500')
because iloc requests a scalar from pandas (type conversion to Timestamp) while values requests the full numpy array (no type conversion).
There is similar behavior for series of length one. Additionally, referencing more than one element in the slice (ie iloc[1:10]) will return a series, which will always keep its datatype.
I'm unsure as to why pandas behaves this way.
In[4]: pd.__version__
Out[4]: '0.15.2'