pandas datetime index unique difference - python

The following works for getting unique difference in consecutive datetime index.
# Data
import pandas
d = pandas.DataFrame({"a": [x for x in range(5)]})
d.index = pandas.date_range("2021-01-01 00:00:00", "2021-01-01 01:00:00", freq="15min")
# Get difference
delta = d.index.to_series().diff().astype("timedelta64[m]").unique()
delta
# array([nan, 15.])
But I am not clear where the nan comes from. I am only interested in the 15 minutes. Is delta[1] a reliable way to get it or am I missing something?

The first row doesn't have anything to diff against, so its NaT.
>>> d.index.to_series().diff()
2021-01-01 00:00:00 NaT
2021-01-01 00:15:00 00:15:00
2021-01-01 00:30:00 00:15:00
2021-01-01 00:45:00 00:15:00
2021-01-01 01:00:00 00:15:00
Freq: 15T, dtype: timedelta64[ns]
From pandas.Series.unique: Uniques are returned in order of appearance.. Since that NaT is guaranteed to be the first element in the returned list it is okay to do delta[1] as you suggest. Assuming you have at least 2 rows and you don't have NaT in the data.
More generally, if you don't want that first value in a diff, you can slice it off
>>> d.index.to_series().diff()[1:]
2021-01-01 00:15:00 00:15:00
2021-01-01 00:30:00 00:15:00
2021-01-01 00:45:00 00:15:00
2021-01-01 01:00:00 00:15:00
Freq: 15T, dtype: timedelta64[ns]

When you do diff , the first item will return NaN in pandas which is not same as R ~
d.index.to_series().diff()
Out[713]:
2021-01-01 00:00:00 NaT
2021-01-01 00:15:00 0 days 00:15:00
2021-01-01 00:30:00 0 days 00:15:00
2021-01-01 00:45:00 0 days 00:15:00
2021-01-01 01:00:00 0 days 00:15:00
Freq: 15T, dtype: timedelta64[ns]

Related

Is there a way how to interpolate on pandas Resampler object directly?

I have a DataFrame with irregular sampling frequency, therefore I would like to resample it and interpolate.
Lets say I have following data:
import pandas as pd
idx = pd.DatetimeIndex(["2021-01-01 00:01:35", "2021-01-01 00:05:01", "2021-01-01 00:08:42"])
df = pd.DataFrame({"a": [1, 2, 3]}, index=idx)
# a
# 2021-01-01 00:01:35 1
# 2021-01-01 00:05:01 2
# 2021-01-01 00:08:42 3
And I would like to get result similar to this one (interpolation using "index" method):
a
2021-01-01 00:02:00 1.121359
2021-01-01 00:03:00 1.412621
2021-01-01 00:04:00 1.703883
2021-01-01 00:05:00 1.995146
2021-01-01 00:06:00 2.266968
2021-01-01 00:07:00 2.538462
2021-01-01 00:08:00 2.809955
For that, I thought that something like df.resample("T").interpolate(method="index") could work but this does not work, I would need to put there some aggregation function, e.g. df.resample("T").mean().interpolate(method="index") but it does not result in a wanted solution.
I could do some workaround like this:
df_res = pd.concat([df, df.resample("T").asfreq()]).sort_index()
df_res = df_res[~df_res.index.duplicated()]
df_res = df_res.interpolate(method="index").dropna()
df_res
# a
# 2021-01-01 00:01:35 1.000000
# 2021-01-01 00:02:00 1.121359
# 2021-01-01 00:03:00 1.412621
# 2021-01-01 00:04:00 1.703883
# 2021-01-01 00:05:00 1.995146
# 2021-01-01 00:05:01 2.000000
# 2021-01-01 00:06:00 2.266968
# 2021-01-01 00:07:00 2.538462
# 2021-01-01 00:08:00 2.809955
# 2021-01-01 00:08:42 3.000000
And then remove the original 3 indexes or keep everything based on my preferences. But I'm wondering whether there is a better solution that could work directly by combining resample and interpolate methods?
There may be other ways to do this, but the base value of the original data is in seconds, so upsampling in seconds is the way to go. There is an interpolation method for resampling, so we will use that. This will result in a complemented data frame of 1 second units, and we will filter that data frame by seconds.
df.resamle('S').interpolate()
df.resample('S').interpolate().head()
a
2021-01-01 00:01:35 1.000000
2021-01-01 00:01:36 1.004854
2021-01-01 00:01:37 1.009709
2021-01-01 00:01:38 1.014563
2021-01-01 00:01:39 1.019417
query
df.resample('S').interpolate().query('index.dt.second == 0')
a
2021-01-01 00:02:00 1.121359
2021-01-01 00:03:00 1.412621
2021-01-01 00:04:00 1.703883
2021-01-01 00:05:00 1.995146
2021-01-01 00:06:00 2.266968
2021-01-01 00:07:00 2.538462
2021-01-01 00:08:00 2.809955

Mean between two datetimes; if NaN, get last non-NaN value

Yesterday I asked this question (with some good answers) which is very similar, but slightly different from the problem I'm presented with now. Say I have the following pd.DataFrame (dict):
eff_timestamp val id begin_timestamp end_timestamp
0 2021-01-01 00:00:00 -0.710230 1 2021-01-01 02:00:00 2021-01-01 05:30:00
1 2021-01-01 01:00:00 0.121464 1 2021-01-01 02:00:00 2021-01-01 05:30:00
2 2021-01-01 02:00:00 -0.156328 1 2021-01-01 02:00:00 2021-01-01 05:30:00
3 2021-01-01 03:00:00 0.788685 1 2021-01-01 02:00:00 2021-01-01 05:30:00
4 2021-01-01 04:00:00 0.505210 1 2021-01-01 02:00:00 2021-01-01 05:30:00
5 2021-01-01 05:00:00 -0.738344 1 2021-01-01 02:00:00 2021-01-01 05:30:00
6 2021-01-01 06:00:00 0.266910 1 2021-01-01 02:00:00 2021-01-01 05:30:00
7 2021-01-01 07:00:00 -0.587401 1 2021-01-01 02:00:00 2021-01-01 05:30:00
8 2021-01-02 00:00:00 -0.160692 2 2021-01-02 12:00:00 2021-01-02 15:30:00
9 2021-01-02 01:00:00 0.306354 2 2021-01-02 12:00:00 2021-01-02 15:30:00
10 2021-01-02 02:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
11 2021-01-02 03:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
12 2021-01-02 04:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
13 2021-01-02 05:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
14 2021-01-02 06:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
15 2021-01-02 07:00:00 -0.349705 2 2021-01-02 12:00:00 2021-01-02 15:30:00
I would like to get the mean value of val for each unique id, for those val's that lie between the begin_timestamp and end_timestamp. If there are no rows that satisfy that criteria, I'd like to get the last value for that id before that period. Note that in this example, id=2 has no rows that satisfy the criteria. Previously I could slice the data so I only keep the rows between the begin and end_timestamp, and then use a groupby. The solution from my previous post then replaces the NaN value in the groupby object. However, in the example above, id=2 has no rows at all that satisfy the criteria, and therefore there is no NaN value created that can be replaced. So if I slice the data based above on the criteria:
sliced = df[(df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp)]
sliced
>>>
eff_timestamp val id begin_timestamp end_timestamp
3 2021-01-01 03:00:00 0.788685 1 2021-01-01 02:00:00 2021-01-01 05:30:00
4 2021-01-01 04:00:00 0.505210 1 2021-01-01 02:00:00 2021-01-01 05:30:00
5 2021-01-01 05:00:00 -0.738344 1 2021-01-01 02:00:00 2021-01-01 05:30:00
sliced.groupby('id').val.mean()
>>>
id
1 0.185184
Name: val, dtype: float64
This result only includes id=1 with the mean value, but there is no value for id=2. How would I, instead of the mean, include the last available value for id=2, which is -0.349705?
Create a temp column between_time. Then Groupby id column and then, in apply add the condition - > If for a particular id is there any value that lies within the range? If yes, take the mean else take the value present at last_valid_index.
result = (
df.assign(
between_time=(df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp))
.groupby('id')
.apply(
lambda x: x.loc[x['between_time']]['val'].mean()
if any(x['between_time'].values)
else
x.loc[x['val'].last_valid_index()]['val']
)
)
OUTPUT:
id
1 0.185184
2 -0.349705
dtype: float64

Applying start and endtime as filters to dataframe

I'm working on a timeseries dataframe which looks like this and has data from January to August 2020.
Timestamp Value
2020-01-01 00:00:00 -68.95370
2020-01-01 00:05:00 -67.90175
2020-01-01 00:10:00 -67.45966
2020-01-01 00:15:00 -67.07624
2020-01-01 00:20:00 -67.30549
.....
2020-07-01 00:00:00 -65.34212
I'm trying to apply a filter on the previous dataframe using the columns start_time and end_time in the dataframe below:
start_time end_time
2020-01-12 16:15:00 2020-01-13 16:00:00
2020-01-26 16:00:00 2020-01-26 16:10:00
2020-04-12 16:00:00 2020-04-13 16:00:00
2020-04-20 16:00:00 2020-04-21 16:00:00
2020-05-02 16:00:00 2020-05-03 16:00:00
The output should assign all values which are not within the start and end time as zero and retain values for the start and end times specified in the filter. I tried applying two simultaneous filters for start and end time but didn't work.
Any help would be appreciated.
Idea is create all masks by Series.between in list comprehension, then join with logical_or by np.logical_or.reduce and last pass to Series.where:
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.95370 <- changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
L = [df1['Timestamp'].between(s, e) for s, e in df2[['start_time','end_time']].values]
m = np.logical_or.reduce(L)
df1['Value'] = df1['Value'].where(m, 0)
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Solution using outer join of merge method and query:
print(df1)
timestamp Value <- changed Timestamp to timestamp to avoid name conflict in query
0 2020-01-13 00:00:00 -68.95370 <- also changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
df1.loc[df1.index.difference(df1.assign(key=0).merge(df2.assign(key=0), how = 'outer')\
.query("timestamp >= start_time and timestamp < end_time").index),"Value"] = 0
result:
timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Key assign(key=0) is added to both dataframes to produce cartesian product.

Pandas .resample() or .asfreq() fill forward times

I'm trying to resample a dataframe with a time series from 1-hour increments to 15-minute. Both .resample() and .asfreq() do almost exactly what I want, but I'm having a hard time filling the last three intervals.
I could add an extra hour at the end, resample, and then drop that last hour, but it feels hacky.
Current code:
df = pd.DataFrame({'date':pd.date_range('2018-01-01 00:00', '2018-01-01 01:00', freq = '1H'), 'num':5})
df = df.set_index('date').asfreq('15T', method = 'ffill', how = 'end').reset_index()
Current output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
Desired output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
5 2018-01-01 01:15:00 5
6 2018-01-01 01:30:00 5
7 2018-01-01 01:45:00 5
Thoughts?
Not sure about asfreq but reindex works wonderfully:
df.set_index('date').reindex(
pd.date_range(
df.date.min(),
df.date.max() + pd.Timedelta('1H'), freq='15T', closed='left'
),
method='ffill'
)
num
2018-01-01 00:00:00 5
2018-01-01 00:15:00 5
2018-01-01 00:30:00 5
2018-01-01 00:45:00 5
2018-01-01 01:00:00 5
2018-01-01 01:15:00 5
2018-01-01 01:30:00 5
2018-01-01 01:45:00 5

How resample dataframe

I have problem, when I resample dataframe index, the date change !!.
>>>dpvis=dpvi.Puissance.resample('10min').mean()
>>> dpvi.head()
Puissance
Date
2016-05-01 00:00:00 0
2016-05-01 00:05:00 0
2016-05-01 00:10:00 0
2016-05-01 00:15:00 0
2016-05-01 00:20:00 0
>>> dpvis.head()
Date
2015-06-14 00:00:00 0.0
2015-06-14 00:10:00 0.0
2015-06-14 00:20:00 0.0
2015-06-14 00:30:00 0.0
2015-06-14 00:40:00 0.0
Freq: 10T, Name: Puissance, dtype: float64
>>>
Here's a demonstration that resample() will work correctly with the data you've provided, assuming that your dtypes are correct. It's not exactly an answer to your problem, but it may serve as a sort of sanity check.
First, generate sample data for a two month period at 5min intervals:
import pandas as pd
Date = pd.date_range("2016-05-01", "2016-07-01", freq="5min", name='Date')
Puissance = {'Puissance': np.zeros(len(Date), dtype=int)}
df = pd.DataFrame(Puissance, index=Date)
df.head()
Puissance
Date
2016-05-01 00:00:00 0
2016-05-01 00:05:00 0
2016-05-01 00:10:00 0
2016-05-01 00:15:00 0
2016-05-01 00:20:00 0
df.shape # (17569, 1)
df.index.dtype # datetime64[ns]
df.Puissance.dtype # int64
Now resample to 10min intervals:
resampled = df.Puissance.resample('10min').mean()
resampled.shape # (8785,)
Note: df.resample('10min').mean() also gives the same results here.
resampled.head()
Date
2016-05-01 00:00:00 0
2016-05-01 00:10:00 0
2016-05-01 00:20:00 0
2016-05-01 00:30:00 0
2016-05-01 00:40:00 0
Freq: 10T, Name: Puissance, dtype: int64
resampled.tail()
Date
2016-06-30 23:20:00 0
2016-06-30 23:30:00 0
2016-06-30 23:40:00 0
2016-06-30 23:50:00 0
2016-07-01 00:00:00 0
Freq: 10T, Name: Puissance, dtype: int64
Resampling works as expected.
This suggests that there's an issue somewhere with your dtype declarations, or with the format of an observation that isn't shown in your head() output.
One clue may be that your Puissance values start out as integers (0), but are resampled as floats (0.0). If all of your Puissance values are zero-valued integers, the mean output dtype will also be int64, as seen above. (mean() will normally return dtype float64 if the values being averaged are not all the same.) Your example data may not be representative of the actual problem you're trying to solve - if so, consider updating your post with a more representative example.

Categories