I'm trying to add a column to a data frame that indicates the time difference of each rows index and a fixed timestamp. The data frame consists of a datetimeindex and some string columns.
I use
d["diff"] = d.index-t0
to calculate said time difference. Due to prior filtering, the biggest possible diff value should be between 10 and 20s. However, I frequently get diffs slightly under a day (1-10s less), even though the actual difference is something like 5s.
I read that a prior version of pandas had issues with exactly this, but it was said to be long fixed.
My workaround would be to copy the index, cast it to int64, cast t0 to int64, substract t0 from all rows and then convert the diff column back to timedeltas, but that seems extremely inefficient and ugly.
PS: It happens on OS X and Debian 8 both using pandas 0.16.0.
EDIT: As requested, one sample:
2013-12-12 13:50:48 # t0
timestamp
2013-12-16 13:50:52 4 days 00:00:04
Name: diff, dtype: timedelta64[ns]
And I just noticed, the date is totally off, I use indexer_between_time() to get the indices and only looked at the time, not the date. This is even more confusing.
indices = df.index.indexer_between_time(start_time=index,end_time=index+DateOffset(seconds=t_offset) )
So the eventual cause of this was that you were using between_time to find times in your desired range. Unfortunately, between_time doesn't actually find times in a range, it finds times matching the same hours of the day, regardless of the day (I have definitely made the same mistake before). To find just the times in a specific range, you can just do:
end_time = index + DateOffset(seconds=t_offset)
df.index[index:end_time]
This works as longs as your DateTimeIndex is monotonic/sorted, if not you may want to sort first.
Related
Background: Sometimes we need to take a date which is a month after than the original timestamp, since not all days are trading days, some adjustments must be made.
I extracted the index of stock close price, getting a time series with lots of timestamps of trading days.
trading_day_glossory = stock_close_full.index
Now, given a datetime-format variable date, with the following function, the program should return me the day variable indicating a trading day. But indeed it did not. The if condition is never evoked, eventually it added up to 9999:99:99 and reported error.
def return_trading_day(day,trading_day_glossory):
while True:
day = day + relativedelta(days=1)
if day in trading_day_glossory:
break
I reckon that comparing a timestamp with a datetime is problematic, so I rewrote the first part of my function in this way:
trading_day_glossory = stock_close_full.index
trading_day_glossory = trading_day_glossory.to_pydatetime()
# Error message: OverflowError: date value out of range
However this change makes no difference. I further tested some characteristics of the variables involved:
testing1 = trading_day_glossory[20] # returns a datetime variable say 2000-05-08 00:00:00
testing2 = day # returns a datetime variable say 2000-05-07 00:00:00
What may be the problem and what should I do?
Thanks.
Not quite sure what is going on because the errors cannot be reproduced from your codes and variables.
However, you can try searchsorted to find the first timestamp not earlier than a given date in a sorted time series by binary search:
trading_day_glossory.searchsorted(day)
It's way better than comparing values in a while loop.
I have a dataframe like this:
Then I am obtaining the hours minutes and seconds as:
train_data['time'] = train_data.index.strftime('%H:%M:%S')
which results in the last column. Then I would like to get the seconds by simply multiplying the hours by 3600, the minutes by 60 and then adding all three. However I have no clue about how to do it in all the rows at once without using a loop.
If I access individually the values as train_data['time'].values[0] it behaves as a string and then it is clear how to proceed, however if I just try to access the values in any other way I get errors.
Use to_timedelta for convert it to timedeltas and then total_seconds:
train_data['time'] = pd.to_timedelta(train_data.index.strftime('%H:%M:%S')).total_seconds()
You can use a combination of sum, map, zip, str.split and int:
t = "01:10:34"
as_seconds = sum( f*t for f,t in zip((3600,60,1),map(int,t.split(":"))))
print(as_seconds) # 4234
and apply this to all your data.
The wiser choice would be to use the datetime.timedelta and get its .total_seconds() though as pointed out by jezrael
Currently I am working with a big dataframe (12x47800). One of the twelve columns is a column consisting of an integer number of seconds. I want to change this column to a column consisting of a datetime.time format. Schedule is my dataframe where I try changing the column named 'depTime'. Since I want it to be a datetime.time and it could cross midnight i added the if-statement. This 'works' but really slow as one could imagine. Is there a faster way to do this?
My current code, the only one I could get working is:
for i in range(len(schedule)):
t_sec = schedule.iloc[i].depTime
[t_min, t_sec] = divmod(t_sec,60)
[t_hour,t_min] = divmod(t_min,60)
if t_hour>23:
t_hour -= 23
schedule['depTime'].iloc[i] = dt.time(int(t_hour),int(t_min),int(t_sec))
Thanks in advance guys.
Ps: I'm pretty new to Python, so if anybody could help me I would be very gratefull :)
I'm adding a new solution which is much faster than the original since it relies on pandas vectorized functions instead of looping (pandas apply functions are essentially optimized loops on the data).
I tested it with a sample similar in size to yours and the difference is from 778ms to 21.3ms. So I definitely recommend the new version.
Both solutions are based on transforming your seconds integers into timedelta format and adding it to a reference datetime. Then, I simply capture the time component of the resulting datetimes.
New (Faster) Option:
import datetime as dt
seconds = pd.Series(np.random.rand(50)*100).astype(int) # Generating test data
start = dt.datetime(2019,1,1,0,0) # You need a reference point
datetime_series = seconds.astype('timedelta64[s]') + start
time_series = datetime_series.dt.time
time_series
Original (slower) Answer:
Not the most elegant solution, but it does the trick.
import datetime as dt
seconds = pd.Series(np.random.rand(50)*100).astype(int) # Generating test data
start = dt.datetime(2019,1,1,0,0) # You need a reference point
time_series = seconds.apply(lambda x: start + pd.Timedelta(seconds=x)).dt.time
You should try not to do a full scan on a dataframe, but instead use vectorized access because it is normally much more efficient.
Fortunately, pandas has a function that does exactly what you are asking for, to_timedelta:
schedule['depTime'] = pd.to_timedelta(schedule['depTime'], unit='s')
It is not really a datetime format, but it is the pandas equivalent of a datetime.timedelta and is a convenient type for processing times. You could use to_datetime but will end with a full datetime close to 1970-01-01...
If you really need datetime.time objects, you can get them that way:
schedule['depTime'] = pd.to_datetime(schedule['depTime'], unit='s').dt.time
but they are less convenient to use in a pandas dataframe.
I try to obtain day deltas for a wide range of pandas dates. However, for time deltas >292 years I obtain negative values. For example,
import pandas as pd
dates = pd.Series(pd.date_range('1700-01-01', periods=4500, freq='m'))
days_delta = (dates-dates.min()).astype('timedelta64[D]')
However, using a DatetimeIndex I can do it and it works as I want it to,
import pandas as pd
import numpy as np
dates = pd.date_range('1700-01-01', periods=4500, freq='m')
days_fun = np.vectorize(lambda x: x.days)
days_delta = days_fun(dates.date - dates.date.min())
The question then is how to obtain the correct days_delta for Series objects?
Read here specifically about timedelta limitations:
Pandas represents Timedeltas in nanosecond resolution using 64 bit integers. As such, the 64 bit integer limits determine the Timedelta limits.
Incidentally this is the same limitation the docs mentioned that is placed on Timestamps in Pandas:
Since pandas represents timestamps in nanosecond resolution, the timespan that can be represented using a 64-bit integer is limited to approximately 584 years
This would suggest that the same recommendations the docs make for circumventing the timestamp limitations can be applied to timedeltas. The solution to the timestamp limitations are found in the docs (here):
If you have data that is outside of the Timestamp bounds, see Timestamp limitations, then you can use a PeriodIndex and/or Series of Periods to do computations.
Workaround
If you have continuous dates with small gaps which are calculatable, as in your example, you could sort the series and then use cumsum to get around this problem, like this:
import pandas as pd
dates = pd.TimeSeries(pd.date_range('1700-01-01', periods=4500, freq='m'))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum().describe()
count 4500.000000
mean 68466.072444
std 39543.094524
min 0.000000
25% 34233.250000
50% 68465.500000
75% 102699.500000
max 136935.000000
dtype: float64
See the min and max are both positive.
Failaround
If you have too big gaps, this workaround with not work. Like here:
dates = pd.Series(pd.datetools.to_datetime(['2016-06-06', '1700-01-01','2200-01-01']))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum()
1 0
0 -97931
2 -30883
This is because we calculate the step between each date, then add them up. And when they are sorted, we are guaranteed the smallest possible steps, however, each step is too big to handle in this case.
Resetting the order
As you see in the Failaround example, the series is no longer ordered by the index. Fix this by calling the .reset_index(inplace=True) method on the series.
I have a dataset with people's departure time for work and the time they take to get where they work. Since people generally go to work every weekday, there obviously is no need for a date associated with the data. I leave for work at 8 AM every working day, and return at 5 PM every working day.
Similarly for schools, offices, etc. There are a number of places where date does not matter as much as time. There is also the converse, where time does not matter as much as date. Back to my problem.
My time is coded as an epoch, and converting to datetime is pretty easy:
In [1]: df['time'] = pd.to_datetime(df['time'], unit='m')
df['time'].head(3)
Out[1]: 0 1970-01-01 06:15:00
1 1970-01-01 06:17:00
2 1970-01-01 08:10:00
Name: time, dtype: datetime64[ns]
But there is the pesky 1970-01-01 in there. I want to get rid of it:
In [2]: df['time'].dt.time.head(3)
Out[2]: 0 06:15:00
1 06:17:00
2 08:10:00
Name: time, dtype: object
Now it is converted into object, which is even peskier than having 1970-01-01, because I cannot do things like:
In [3]: df['time'].dt.time + pd.to_timedelta(df['travel'], unit='m')
Out[3]: ---------------------------------------------------------------------
TypeError Traceback (most recent call last)
< whole bunch of tracebacks. I know what's going on here >
TypeError: ufunc subtract cannot use operands with types dtype('O') and dtype('<m8[ns]')
Then there is this numpy page, with tons of examples, but every single one of them has a date component; none have only the time component. For example, I quote:
>>> np.array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64')
array(['2007-07-13', '2006-01-13', '2010-08-13'], dtype='datetime64[D]')
The story repeats in this Pandas page. There are numerous examples with only date component, but not a single example with only time component.
Why the lack of love to storing pure time in a manipulatable format? Do I have to resort to converting all of my data into Python's native datetime.time type (which will kill me because I have billions of rows to process)? What I am looking for is a way to store only the time component in a manipulatable format. An answer which sheds light in that direction will be accepted.
Since, #unutbu has not posted an answer to this question, but just commented on it, I shall post what worked, and accept it as answer. If later #unutbu does post an answer, I shall accept that.
Basically, as I mention in the question, date component of datetime does not matter to me for this task. Therefore, the simplest solution is to do the arithmetic first, then get just time:
(df['time'] + pd.to_timedelta(df['travel'], unit='m')).dt.time