I try to obtain day deltas for a wide range of pandas dates. However, for time deltas >292 years I obtain negative values. For example,
import pandas as pd
dates = pd.Series(pd.date_range('1700-01-01', periods=4500, freq='m'))
days_delta = (dates-dates.min()).astype('timedelta64[D]')
However, using a DatetimeIndex I can do it and it works as I want it to,
import pandas as pd
import numpy as np
dates = pd.date_range('1700-01-01', periods=4500, freq='m')
days_fun = np.vectorize(lambda x: x.days)
days_delta = days_fun(dates.date - dates.date.min())
The question then is how to obtain the correct days_delta for Series objects?
Read here specifically about timedelta limitations:
Pandas represents Timedeltas in nanosecond resolution using 64 bit integers. As such, the 64 bit integer limits determine the Timedelta limits.
Incidentally this is the same limitation the docs mentioned that is placed on Timestamps in Pandas:
Since pandas represents timestamps in nanosecond resolution, the timespan that can be represented using a 64-bit integer is limited to approximately 584 years
This would suggest that the same recommendations the docs make for circumventing the timestamp limitations can be applied to timedeltas. The solution to the timestamp limitations are found in the docs (here):
If you have data that is outside of the Timestamp bounds, see Timestamp limitations, then you can use a PeriodIndex and/or Series of Periods to do computations.
Workaround
If you have continuous dates with small gaps which are calculatable, as in your example, you could sort the series and then use cumsum to get around this problem, like this:
import pandas as pd
dates = pd.TimeSeries(pd.date_range('1700-01-01', periods=4500, freq='m'))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum().describe()
count 4500.000000
mean 68466.072444
std 39543.094524
min 0.000000
25% 34233.250000
50% 68465.500000
75% 102699.500000
max 136935.000000
dtype: float64
See the min and max are both positive.
Failaround
If you have too big gaps, this workaround with not work. Like here:
dates = pd.Series(pd.datetools.to_datetime(['2016-06-06', '1700-01-01','2200-01-01']))
dates.sort()
dateshift = dates.shift(1)
(dates - dateshift).fillna(0).dt.days.cumsum()
1 0
0 -97931
2 -30883
This is because we calculate the step between each date, then add them up. And when they are sorted, we are guaranteed the smallest possible steps, however, each step is too big to handle in this case.
Resetting the order
As you see in the Failaround example, the series is no longer ordered by the index. Fix this by calling the .reset_index(inplace=True) method on the series.
Related
I would like to create a column in a pandas data frame that is an integer representation of the number of days in a timedelta column. Is it possible to use 'datetime.days' or do I need to do something more manual?
timedelta column
7 days, 23:29:00
day integer column
7
The Series class has a pandas.Series.dt accessor object with several
useful datetime attributes, including dt.days. Access this attribute via:
timedelta_series.dt.days
You can also get the seconds and microseconds attributes in the same way.
You could do this, where td is your series of timedeltas. The division converts the nanosecond deltas into day deltas, and the conversion to int drops to whole days.
import numpy as np
(td / np.timedelta64(1, 'D')).astype(int)
Timedelta objects have read-only instance attributes .days, .seconds, and .microseconds.
If the question isn't just "how to access an integer form of the timedelta?" but "how to convert the timedelta column in the dataframe to an int?" the answer might be a little different. In addition to the .dt.days accessor you need either df.astype or pd.to_numeric
Either of these options should help:
df['tdColumn'] = pd.to_numeric(df['tdColumn'].dt.days, downcast='integer')
or
df['tdColumn'] = df['tdColumn'].dt.days.astype('int16')
The simplest way to do this is by
df["DateColumn"] = (df["DateColumn"]).dt.days
A great way to do this is
dif_in_days = dif.days
(where dif is the difference between dates)
I have a data frame consisting of hourly wind speed measurements for the year 2012 for different locations as seen below:
I was able to change the index to datetime format for just one location using the code:
dfn=df1_s.reset_index(drop=True)
import datetime
dfn['datetime']=pd.to_datetime(dfn.index,
origin=pd.Timestamp('2012-01-01 00:00:00'), unit= 'H')
When using the same code over the entire dataframe, i obtain the error: cannot convert input with unit 'h'. This is probably because there are way to much data than the number of years to be represented by 'hours', i am not sure. Nevertheless, it works when i use units in minutes i.e. units='m'.
What i want to do is to set the datetime in such a way that it repeats itself after every 8784 hours i.e. having the same replicate datetime format for each location on the same dataframe as seen in the image below(expected results produced on excel).
When trying the following, all i obtained was a column with a series on NaNs:
import pdb, random
dates = pd.date_range('2012-01-01', '2013-01-01', freq='H')
data = [int(1000*random.random()) for i in range(len(dates))]
dfn['cum_data'] = pd.Series(data, index=dates)
Can you please direct me on how to go about this?
I have two time series data that gives the electricity demand in one-hour resolution and five-minute resolution. I am trying to find the maximum difference between these two time series. So the one-hour resolution data has 8760 rows (hourly for an year) and the 5-minute resolution data has 104,722 rows (5-minutly for an year).
I can only think of a method that will expand the hourly data into 5 minute resolution that will have 12 times repeating of the hourly data and find the maximum of the difference of the two data sets.
If this technique is the way to go, is there an easy way to convert my hourly data into 5-minute resolution by repeating the hourly data 12 times?
for your reference I posted a plot of this data for one day.
P.S> I am using Python to do this task
Numpy's .repeat() function
You can change your hourly data into 5-minute data by using numpy's repeat function
import numpy as np
np.repeat(hourly_data, 12)
I would strongly recommend against converting the hourly data into five-minute data. If the data in both cases refers to the mean load of those time ranges, you'll be looking at more accurate data if you group the five-minute intervals into hourly datasets. You'd get more granularity the way you're talking about, but the granularity is not based on accurate data, so you're not actually getting more value from it. If you aggregate the five-minute chunks into hourly chunks and compare the series that way, you can be more confident in the trustworthiness of your results.
In order to group them together to get that result, you can define a function like the following and use the apply method like so:
def to_hour(date):
date = date.strftime("%Y-%m-%d %H:00:00")
date = dt.strptime(date, "%Y-%m-%d %H:%M:%S")
return date
df['Aggregated_Datetime'] = df['Original_Datetime'].apply(lambda x: to_hour(x))
df.groupby('Aggregated_Datetime').agg('Real-Time Lo
I'm trying to add a column to a data frame that indicates the time difference of each rows index and a fixed timestamp. The data frame consists of a datetimeindex and some string columns.
I use
d["diff"] = d.index-t0
to calculate said time difference. Due to prior filtering, the biggest possible diff value should be between 10 and 20s. However, I frequently get diffs slightly under a day (1-10s less), even though the actual difference is something like 5s.
I read that a prior version of pandas had issues with exactly this, but it was said to be long fixed.
My workaround would be to copy the index, cast it to int64, cast t0 to int64, substract t0 from all rows and then convert the diff column back to timedeltas, but that seems extremely inefficient and ugly.
PS: It happens on OS X and Debian 8 both using pandas 0.16.0.
EDIT: As requested, one sample:
2013-12-12 13:50:48 # t0
timestamp
2013-12-16 13:50:52 4 days 00:00:04
Name: diff, dtype: timedelta64[ns]
And I just noticed, the date is totally off, I use indexer_between_time() to get the indices and only looked at the time, not the date. This is even more confusing.
indices = df.index.indexer_between_time(start_time=index,end_time=index+DateOffset(seconds=t_offset) )
So the eventual cause of this was that you were using between_time to find times in your desired range. Unfortunately, between_time doesn't actually find times in a range, it finds times matching the same hours of the day, regardless of the day (I have definitely made the same mistake before). To find just the times in a specific range, you can just do:
end_time = index + DateOffset(seconds=t_offset)
df.index[index:end_time]
This works as longs as your DateTimeIndex is monotonic/sorted, if not you may want to sort first.
I'm trying to calculate daily sums of values using pandas. Here's the test file - http://pastebin.com/uSDfVkTS
This is the code I came up so far:
import numpy as np
import datetime as dt
import pandas as pd
f = np.genfromtxt('test', dtype=[('datetime', '|S16'), ('data', '<i4')], delimiter=',')
dates = [dt.datetime.strptime(i, '%Y-%m-%d %H:%M') for i in f['datetime']]
s = pd.Series(f['data'], index = dates)
d = s.resample('D', how='sum')
Using the given test file this produces:
2012-01-02 1128
Freq: D
First problem is that calculated sum corresponds to the next day. I've been able to solve that by using parameter loffset='-1d'.
Now the actual problem is that the data may start not from 00:30 of a day but at any time of a day. Also the data has gaps filled with 'nan' values.
That said, is it possible to set a lower threshold of number of values that are necessary to calculate daily sums? (e.g. if there're less than 40 values in a single day, then put NaN instead of a sum)
I believe that it is possible to define a custom function to do that and refer to it in 'how' parameter, but I have no clue how to code the function itself.
You can do it directly in Pandas:
s = pd.read_csv('test', header=None, index_col=0, parse_dates=True)
d = s.groupby(lambda x: x.date()).aggregate(lambda x: sum(x) if len(x) >= 40 else np.nan)
X.2
2012-01-01 1128
Much easier way is to use pd.Grouper:
d = s.groupby(pd.Grouper(freq='1D')).sum()