I have a DataFrame with irregular sampling frequency, therefore I would like to resample it and interpolate.
Lets say I have following data:
import pandas as pd
idx = pd.DatetimeIndex(["2021-01-01 00:01:35", "2021-01-01 00:05:01", "2021-01-01 00:08:42"])
df = pd.DataFrame({"a": [1, 2, 3]}, index=idx)
# a
# 2021-01-01 00:01:35 1
# 2021-01-01 00:05:01 2
# 2021-01-01 00:08:42 3
And I would like to get result similar to this one (interpolation using "index" method):
a
2021-01-01 00:02:00 1.121359
2021-01-01 00:03:00 1.412621
2021-01-01 00:04:00 1.703883
2021-01-01 00:05:00 1.995146
2021-01-01 00:06:00 2.266968
2021-01-01 00:07:00 2.538462
2021-01-01 00:08:00 2.809955
For that, I thought that something like df.resample("T").interpolate(method="index") could work but this does not work, I would need to put there some aggregation function, e.g. df.resample("T").mean().interpolate(method="index") but it does not result in a wanted solution.
I could do some workaround like this:
df_res = pd.concat([df, df.resample("T").asfreq()]).sort_index()
df_res = df_res[~df_res.index.duplicated()]
df_res = df_res.interpolate(method="index").dropna()
df_res
# a
# 2021-01-01 00:01:35 1.000000
# 2021-01-01 00:02:00 1.121359
# 2021-01-01 00:03:00 1.412621
# 2021-01-01 00:04:00 1.703883
# 2021-01-01 00:05:00 1.995146
# 2021-01-01 00:05:01 2.000000
# 2021-01-01 00:06:00 2.266968
# 2021-01-01 00:07:00 2.538462
# 2021-01-01 00:08:00 2.809955
# 2021-01-01 00:08:42 3.000000
And then remove the original 3 indexes or keep everything based on my preferences. But I'm wondering whether there is a better solution that could work directly by combining resample and interpolate methods?
There may be other ways to do this, but the base value of the original data is in seconds, so upsampling in seconds is the way to go. There is an interpolation method for resampling, so we will use that. This will result in a complemented data frame of 1 second units, and we will filter that data frame by seconds.
df.resamle('S').interpolate()
df.resample('S').interpolate().head()
a
2021-01-01 00:01:35 1.000000
2021-01-01 00:01:36 1.004854
2021-01-01 00:01:37 1.009709
2021-01-01 00:01:38 1.014563
2021-01-01 00:01:39 1.019417
query
df.resample('S').interpolate().query('index.dt.second == 0')
a
2021-01-01 00:02:00 1.121359
2021-01-01 00:03:00 1.412621
2021-01-01 00:04:00 1.703883
2021-01-01 00:05:00 1.995146
2021-01-01 00:06:00 2.266968
2021-01-01 00:07:00 2.538462
2021-01-01 00:08:00 2.809955
Related
I have a dataframe with a column 'queue_ist_dt'. This column contains pandas._libs.tslibs.timestamps.Timestamp values. My requirement is :
if time = 10:13:00 then round_off_time = 10:00:00
if time = 23:29:00 then round_off_time = 23:00:00
and so on.
if time = 10:31:00 then round_off_time = 10:30:00
if time = 23:53:00 then round_off_time = 23:30:00
and so on.
if time = 10:30:00 then round_off_time = 10:30:00
These are the 3 conditions.
I tried to write the following logic :
for r in range(df.shape[0]):
try:
if df.loc[r,'queue_ist_dt'].minute<30:
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- timedelta
elif df.loc[r,'queue_ist_dt'].minute>30:
******NEED HELP TO BUILD THIS LOGIC******
except:
pass
Need help to build logic for the time where minutes is greater than 30 mins and have to be rounded down to 30 mins.
Use Series.dt.floor:
#if necessary convert to datetimes
df['queue_ist_dt'] = pd.to_datetime(df['queue_ist_dt'].astype(str))
df['queue_ist_dt1'] = df['queue_ist_dt'].dt.floor('30Min').dt.time
print (df)
Logic is subtract 30 minute from timedelta
code is as below:
for r in range(df.shape[0]):
try:
if df.loc[r,'queue_ist_dt'].minute<30:
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- timedelta
elif df.loc[r,'queue_ist_dt'].minute>30:
******THIS LOGIC******
timedelta = pd.Timedelta(minutes=df.loc[r,'queue_ist_dt'].minute)
df.loc[r,'queue_placed_interval'] = df.loc[r,'queue_ist_dt']- (timedelta-30)
except:
pass
Let me know if this helps you😊
Considering this dataframe df as example
df = pd.DataFrame({'queue_ist_dt': [pd.Timestamp('2021-01-01 10:00:00'),
pd.Timestamp('2021-01-01 10:30:00'),
pd.Timestamp('2021-01-01 11:00:00'),
pd.Timestamp('2021-01-01 11:30:00'),
pd.Timestamp('2021-01-01 23:00:00'),
pd.Timestamp('2021-01-01 23:30:00'),
pd.Timestamp('2021-01-01 23:30:00')]
})
[Out]:
queue_ist_dt
0 2021-01-01 10:01:00
1 2021-01-01 10:35:00
2 2021-01-01 11:19:00
3 2021-01-01 11:33:00
4 2021-01-01 23:23:00
5 2021-01-01 23:22:00
6 2021-01-01 23:55:00
One way would be to use pandas.Series.dt.round as follows
df['round_off_time'] = df['queue_ist_dt'].dt.round('30min')
[Out]:
queue_ist_dt round_off_time
0 2021-01-01 10:01:00 2021-01-01 10:00:00
1 2021-01-01 10:35:00 2021-01-01 10:30:00
2 2021-01-01 11:19:00 2021-01-01 11:30:00
3 2021-01-01 11:33:00 2021-01-01 11:30:00
4 2021-01-01 23:23:00 2021-01-01 23:30:00
5 2021-01-01 23:22:00 2021-01-01 23:30:00
6 2021-01-01 23:55:00 2021-01-02 00:00:00
If the goal is to change the values in the column queue_ist_dt, do the following
df['queue_ist_dt'] = df['queue_ist_dt'].dt.round('30min')
[Out]:
queue_ist_dt
0 2021-01-01 10:00:00
1 2021-01-01 10:30:00
2 2021-01-01 11:30:00
3 2021-01-01 11:30:00
4 2021-01-01 23:30:00
5 2021-01-01 23:30:00
6 2021-01-02 00:00:00
I want to groupby and resample a dataframe i have. I group by int_var and bool_var, and then I resample per 1Min to fill in any missing minutes in the dataset. This works perfectly fine for the base dataframe A:
date bool_var int_var
2021-01-01 00:03:00 True 1
2021-01-01 00:06:00 False 6
2021-01-01 00:06:00 True 6
The result then becomes something like this:
int_var bool_var date
1 True 2021-01-01 00:03:00 1
2021-01-01 00:04:00 0
2021-01-01 00:05:00 0
2021-01-01 00:06:00 0
6 True 2021-01-01 00:03:00 0
2021-01-01 00:04:00 0
2021-01-01 00:05:00 0
2021-01-01 00:06:00 1
6 False 2021-01-01 00:03:00 0
2021-01-01 00:04:00 0
2021-01-01 00:05:00 0
2021-01-01 00:06:00 1
This is exactly what I want. However, as you can see the data starts a bit after midnight, and I want those minutes from midnight to be in there as well. So I append a row for each bool_var / int_var combination at 2021-01-01 00:00:00, to make sure the resampling starts from there.
rows = []
some for loop:
rows.append()
extra_rows_df = pd.DataFrame(rows, columns=['date', 'bool_var', 'int_var'])
B = pd.concat([A, extra_rows_df], ignore_index=True)
The resulting dataframe B appear to be correct, and in the same format as dataframe A:
date bool_var int_var
2021-01-01 00:00:00 True 1
2021-01-01 00:03:00 True 1
2021-01-01 00:00:00 False 6
2021-01-01 00:06:00 False 6
2021-01-01 00:00:00 True 6
2021-01-01 00:06:00 True 6
However, if I run the exact same groupby and resample command on dataframe B. My results are all weird:
date 2021-01-01 00:00:00 ... 2021-12-31 23:59:00
int_var bool_var 1 ... 1
1 True
6 True
False
It is like each date suddenly became a column instead of being listed for each grouping.
TL;DR: use stack().
I figured it out. In dataframe A, every bool_var / int_var group has different datetime values; here (1, True) started with 00:03, but some other group, e.g. (2, True) could start with an entry at 01:14. Once I filled out dataframe A so that each group had an entry at 00:00 in dataframe B, and I resampled to fill in each minute, every group had each datetime. In this way, all those datetimes could become columns since they apply to each group.
The solution is to use stack() on this final result
The following works for getting unique difference in consecutive datetime index.
# Data
import pandas
d = pandas.DataFrame({"a": [x for x in range(5)]})
d.index = pandas.date_range("2021-01-01 00:00:00", "2021-01-01 01:00:00", freq="15min")
# Get difference
delta = d.index.to_series().diff().astype("timedelta64[m]").unique()
delta
# array([nan, 15.])
But I am not clear where the nan comes from. I am only interested in the 15 minutes. Is delta[1] a reliable way to get it or am I missing something?
The first row doesn't have anything to diff against, so its NaT.
>>> d.index.to_series().diff()
2021-01-01 00:00:00 NaT
2021-01-01 00:15:00 00:15:00
2021-01-01 00:30:00 00:15:00
2021-01-01 00:45:00 00:15:00
2021-01-01 01:00:00 00:15:00
Freq: 15T, dtype: timedelta64[ns]
From pandas.Series.unique: Uniques are returned in order of appearance.. Since that NaT is guaranteed to be the first element in the returned list it is okay to do delta[1] as you suggest. Assuming you have at least 2 rows and you don't have NaT in the data.
More generally, if you don't want that first value in a diff, you can slice it off
>>> d.index.to_series().diff()[1:]
2021-01-01 00:15:00 00:15:00
2021-01-01 00:30:00 00:15:00
2021-01-01 00:45:00 00:15:00
2021-01-01 01:00:00 00:15:00
Freq: 15T, dtype: timedelta64[ns]
When you do diff , the first item will return NaN in pandas which is not same as R ~
d.index.to_series().diff()
Out[713]:
2021-01-01 00:00:00 NaT
2021-01-01 00:15:00 0 days 00:15:00
2021-01-01 00:30:00 0 days 00:15:00
2021-01-01 00:45:00 0 days 00:15:00
2021-01-01 01:00:00 0 days 00:15:00
Freq: 15T, dtype: timedelta64[ns]
Yesterday I asked this question (with some good answers) which is very similar, but slightly different from the problem I'm presented with now. Say I have the following pd.DataFrame (dict):
eff_timestamp val id begin_timestamp end_timestamp
0 2021-01-01 00:00:00 -0.710230 1 2021-01-01 02:00:00 2021-01-01 05:30:00
1 2021-01-01 01:00:00 0.121464 1 2021-01-01 02:00:00 2021-01-01 05:30:00
2 2021-01-01 02:00:00 -0.156328 1 2021-01-01 02:00:00 2021-01-01 05:30:00
3 2021-01-01 03:00:00 0.788685 1 2021-01-01 02:00:00 2021-01-01 05:30:00
4 2021-01-01 04:00:00 0.505210 1 2021-01-01 02:00:00 2021-01-01 05:30:00
5 2021-01-01 05:00:00 -0.738344 1 2021-01-01 02:00:00 2021-01-01 05:30:00
6 2021-01-01 06:00:00 0.266910 1 2021-01-01 02:00:00 2021-01-01 05:30:00
7 2021-01-01 07:00:00 -0.587401 1 2021-01-01 02:00:00 2021-01-01 05:30:00
8 2021-01-02 00:00:00 -0.160692 2 2021-01-02 12:00:00 2021-01-02 15:30:00
9 2021-01-02 01:00:00 0.306354 2 2021-01-02 12:00:00 2021-01-02 15:30:00
10 2021-01-02 02:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
11 2021-01-02 03:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
12 2021-01-02 04:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
13 2021-01-02 05:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
14 2021-01-02 06:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
15 2021-01-02 07:00:00 -0.349705 2 2021-01-02 12:00:00 2021-01-02 15:30:00
I would like to get the mean value of val for each unique id, for those val's that lie between the begin_timestamp and end_timestamp. If there are no rows that satisfy that criteria, I'd like to get the last value for that id before that period. Note that in this example, id=2 has no rows that satisfy the criteria. Previously I could slice the data so I only keep the rows between the begin and end_timestamp, and then use a groupby. The solution from my previous post then replaces the NaN value in the groupby object. However, in the example above, id=2 has no rows at all that satisfy the criteria, and therefore there is no NaN value created that can be replaced. So if I slice the data based above on the criteria:
sliced = df[(df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp)]
sliced
>>>
eff_timestamp val id begin_timestamp end_timestamp
3 2021-01-01 03:00:00 0.788685 1 2021-01-01 02:00:00 2021-01-01 05:30:00
4 2021-01-01 04:00:00 0.505210 1 2021-01-01 02:00:00 2021-01-01 05:30:00
5 2021-01-01 05:00:00 -0.738344 1 2021-01-01 02:00:00 2021-01-01 05:30:00
sliced.groupby('id').val.mean()
>>>
id
1 0.185184
Name: val, dtype: float64
This result only includes id=1 with the mean value, but there is no value for id=2. How would I, instead of the mean, include the last available value for id=2, which is -0.349705?
Create a temp column between_time. Then Groupby id column and then, in apply add the condition - > If for a particular id is there any value that lies within the range? If yes, take the mean else take the value present at last_valid_index.
result = (
df.assign(
between_time=(df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp))
.groupby('id')
.apply(
lambda x: x.loc[x['between_time']]['val'].mean()
if any(x['between_time'].values)
else
x.loc[x['val'].last_valid_index()]['val']
)
)
OUTPUT:
id
1 0.185184
2 -0.349705
dtype: float64
I have a time series dataframe with a DateTimeIndex, based on sensor data which sometimes arrives a bit early or a bit late. It looks something like this:
df = pd.DataFrame(np.ones(3), index=pd.DatetimeIndex([
'2021-01-01 08:00', '2021-01-01 08:04', '2021-01-01 08:11']))
> df
2021-01-01 08:00:00 1.0
2021-01-01 08:04:00 1.0
2021-01-01 08:11:00 1.0
I'd like to rearrange it to match five-minute intervals without losing any data. I tried:
df.reindex(df.index.round('5 min'))
but it drops the data not matching the intervals:
2021-01-01 08:00:00 1.0
2021-01-01 08:05:00 NaN
2021-01-01 08:10:00 NaN
Is there a way to get this?
2021-01-01 08:00:00 1.0
2021-01-01 08:05:00 1.0
2021-01-01 08:10:00 1.0
I think you need method='nearest' in DataFrame.reindex:
df = df.reindex(df.index.round('5 min'), method='nearest')
print (df)
0
2021-01-01 08:00:00 1.0
2021-01-01 08:05:00 1.0
2021-01-01 08:10:00 1.0