Rounding df index along with its data to nearest interval - python

I have a time series dataframe with a DateTimeIndex, based on sensor data which sometimes arrives a bit early or a bit late. It looks something like this:
df = pd.DataFrame(np.ones(3), index=pd.DatetimeIndex([
'2021-01-01 08:00', '2021-01-01 08:04', '2021-01-01 08:11']))
> df
2021-01-01 08:00:00 1.0
2021-01-01 08:04:00 1.0
2021-01-01 08:11:00 1.0
I'd like to rearrange it to match five-minute intervals without losing any data. I tried:
df.reindex(df.index.round('5 min'))
but it drops the data not matching the intervals:
2021-01-01 08:00:00 1.0
2021-01-01 08:05:00 NaN
2021-01-01 08:10:00 NaN
Is there a way to get this?
2021-01-01 08:00:00 1.0
2021-01-01 08:05:00 1.0
2021-01-01 08:10:00 1.0

I think you need method='nearest' in DataFrame.reindex:
df = df.reindex(df.index.round('5 min'), method='nearest')
print (df)
0
2021-01-01 08:00:00 1.0
2021-01-01 08:05:00 1.0
2021-01-01 08:10:00 1.0

Related

Sum of dataframes : treating NaN as 0 when summed with other values, but returning NaN where all summed elements are NaN

I am trying to add some dataframes that contain NaN values. The data frames are index by time series, and in my case a NaN is meaningful, it means that a measurement wasn't done. So if all the data frames I'm adding have a NaN for a given timestamp, I need the result to have a NaN for this timestamp. But if one or more df have a value for the timestamp, I need to have the sum of theses values.
EDIT : Also, in my case, a 0 is different from an NaN, it means that there was a mesurement and it mesured 0 activity, different from a NaN meaning that there was no mesurement. So any solution using fillna(0) won't work.
I haven't found a proper way to do this yet. Here is an exemple of what I want to do :
import pandas as pd
df1 = pd.DataFrame({'value': [0, 1, 1, 1, np.NaN, np.NaN, np.NaN]},
index=pd.date_range("01/01/2020 00:00", "01/01/2020 01:00", freq = '10T'))
df2 = pd.DataFrame({'value': [0, 5, 5, 5, 5, 5, np.NaN]},
index=pd.date_range("01/01/2020 00:00", "01/01/2020 01:00", freq = '10T'))
df1 + df2
What i get :
df1 + df2
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 NaN
2020-01-01 00:50:00 NaN
2020-01-01 01:00:00 NaN
What I would want to have as a result :
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 5.0
2020-01-01 00:50:00 5.0
2020-01-01 01:00:00 NaN
Does anybody know a clean way to do so ?
Thank you.
(I'm using Python 3.9.1 and pandas 1.2.4)
You can use add with the fill_value=0 option. This will maintain the "all NaN" combinations as NaN:
df1.add(df2, fill_value=0)
output:
value
2020-01-01 00:00:00 0.0
2020-01-01 00:10:00 6.0
2020-01-01 00:20:00 6.0
2020-01-01 00:30:00 6.0
2020-01-01 00:40:00 5.0
2020-01-01 00:50:00 5.0
2020-01-01 01:00:00 NaN

Is there a way how to interpolate on pandas Resampler object directly?

I have a DataFrame with irregular sampling frequency, therefore I would like to resample it and interpolate.
Lets say I have following data:
import pandas as pd
idx = pd.DatetimeIndex(["2021-01-01 00:01:35", "2021-01-01 00:05:01", "2021-01-01 00:08:42"])
df = pd.DataFrame({"a": [1, 2, 3]}, index=idx)
# a
# 2021-01-01 00:01:35 1
# 2021-01-01 00:05:01 2
# 2021-01-01 00:08:42 3
And I would like to get result similar to this one (interpolation using "index" method):
a
2021-01-01 00:02:00 1.121359
2021-01-01 00:03:00 1.412621
2021-01-01 00:04:00 1.703883
2021-01-01 00:05:00 1.995146
2021-01-01 00:06:00 2.266968
2021-01-01 00:07:00 2.538462
2021-01-01 00:08:00 2.809955
For that, I thought that something like df.resample("T").interpolate(method="index") could work but this does not work, I would need to put there some aggregation function, e.g. df.resample("T").mean().interpolate(method="index") but it does not result in a wanted solution.
I could do some workaround like this:
df_res = pd.concat([df, df.resample("T").asfreq()]).sort_index()
df_res = df_res[~df_res.index.duplicated()]
df_res = df_res.interpolate(method="index").dropna()
df_res
# a
# 2021-01-01 00:01:35 1.000000
# 2021-01-01 00:02:00 1.121359
# 2021-01-01 00:03:00 1.412621
# 2021-01-01 00:04:00 1.703883
# 2021-01-01 00:05:00 1.995146
# 2021-01-01 00:05:01 2.000000
# 2021-01-01 00:06:00 2.266968
# 2021-01-01 00:07:00 2.538462
# 2021-01-01 00:08:00 2.809955
# 2021-01-01 00:08:42 3.000000
And then remove the original 3 indexes or keep everything based on my preferences. But I'm wondering whether there is a better solution that could work directly by combining resample and interpolate methods?
There may be other ways to do this, but the base value of the original data is in seconds, so upsampling in seconds is the way to go. There is an interpolation method for resampling, so we will use that. This will result in a complemented data frame of 1 second units, and we will filter that data frame by seconds.
df.resamle('S').interpolate()
df.resample('S').interpolate().head()
a
2021-01-01 00:01:35 1.000000
2021-01-01 00:01:36 1.004854
2021-01-01 00:01:37 1.009709
2021-01-01 00:01:38 1.014563
2021-01-01 00:01:39 1.019417
query
df.resample('S').interpolate().query('index.dt.second == 0')
a
2021-01-01 00:02:00 1.121359
2021-01-01 00:03:00 1.412621
2021-01-01 00:04:00 1.703883
2021-01-01 00:05:00 1.995146
2021-01-01 00:06:00 2.266968
2021-01-01 00:07:00 2.538462
2021-01-01 00:08:00 2.809955

Mean between two datetimes; if NaN, get last non-NaN value

Yesterday I asked this question (with some good answers) which is very similar, but slightly different from the problem I'm presented with now. Say I have the following pd.DataFrame (dict):
eff_timestamp val id begin_timestamp end_timestamp
0 2021-01-01 00:00:00 -0.710230 1 2021-01-01 02:00:00 2021-01-01 05:30:00
1 2021-01-01 01:00:00 0.121464 1 2021-01-01 02:00:00 2021-01-01 05:30:00
2 2021-01-01 02:00:00 -0.156328 1 2021-01-01 02:00:00 2021-01-01 05:30:00
3 2021-01-01 03:00:00 0.788685 1 2021-01-01 02:00:00 2021-01-01 05:30:00
4 2021-01-01 04:00:00 0.505210 1 2021-01-01 02:00:00 2021-01-01 05:30:00
5 2021-01-01 05:00:00 -0.738344 1 2021-01-01 02:00:00 2021-01-01 05:30:00
6 2021-01-01 06:00:00 0.266910 1 2021-01-01 02:00:00 2021-01-01 05:30:00
7 2021-01-01 07:00:00 -0.587401 1 2021-01-01 02:00:00 2021-01-01 05:30:00
8 2021-01-02 00:00:00 -0.160692 2 2021-01-02 12:00:00 2021-01-02 15:30:00
9 2021-01-02 01:00:00 0.306354 2 2021-01-02 12:00:00 2021-01-02 15:30:00
10 2021-01-02 02:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
11 2021-01-02 03:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
12 2021-01-02 04:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
13 2021-01-02 05:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
14 2021-01-02 06:00:00 NaN 2 2021-01-02 12:00:00 2021-01-02 15:30:00
15 2021-01-02 07:00:00 -0.349705 2 2021-01-02 12:00:00 2021-01-02 15:30:00
I would like to get the mean value of val for each unique id, for those val's that lie between the begin_timestamp and end_timestamp. If there are no rows that satisfy that criteria, I'd like to get the last value for that id before that period. Note that in this example, id=2 has no rows that satisfy the criteria. Previously I could slice the data so I only keep the rows between the begin and end_timestamp, and then use a groupby. The solution from my previous post then replaces the NaN value in the groupby object. However, in the example above, id=2 has no rows at all that satisfy the criteria, and therefore there is no NaN value created that can be replaced. So if I slice the data based above on the criteria:
sliced = df[(df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp)]
sliced
>>>
eff_timestamp val id begin_timestamp end_timestamp
3 2021-01-01 03:00:00 0.788685 1 2021-01-01 02:00:00 2021-01-01 05:30:00
4 2021-01-01 04:00:00 0.505210 1 2021-01-01 02:00:00 2021-01-01 05:30:00
5 2021-01-01 05:00:00 -0.738344 1 2021-01-01 02:00:00 2021-01-01 05:30:00
sliced.groupby('id').val.mean()
>>>
id
1 0.185184
Name: val, dtype: float64
This result only includes id=1 with the mean value, but there is no value for id=2. How would I, instead of the mean, include the last available value for id=2, which is -0.349705?
Create a temp column between_time. Then Groupby id column and then, in apply add the condition - > If for a particular id is there any value that lies within the range? If yes, take the mean else take the value present at last_valid_index.
result = (
df.assign(
between_time=(df.eff_timestamp > df.begin_timestamp) & (df.eff_timestamp < df.end_timestamp))
.groupby('id')
.apply(
lambda x: x.loc[x['between_time']]['val'].mean()
if any(x['between_time'].values)
else
x.loc[x['val'].last_valid_index()]['val']
)
)
OUTPUT:
id
1 0.185184
2 -0.349705
dtype: float64

How to impute missing value in time series data with the value of the same day and time from the previous week(day) in python

I have a dataframe with columns of timestamp and energy usage. The timestamp is taken for every min of the day i.e., a total of 1440 readings for each day. I have few missing values in the data frame.
I want to impute those missing values with the mean of the same day, same time from the last two or three week. This way if the previous week is also missing, I can use the value for two weeks ago.
Here's a example of the data:
mains_1
timestamp
2013-01-03 00:00:00 155.00
2013-01-03 00:01:00 154.00
2013-01-03 00:02:00 NaN
2013-01-03 00:03:00 154.00
2013-01-03 00:04:00 153.00
... ...
2013-04-30 23:55:00 NaN
2013-04-30 23:56:00 182.00
2013-04-30 23:57:00 181.00
2013-04-30 23:58:00 182.00
2013-04-30 23:59:00 182.00
Right now I have this line of code:
df['mains_1'] = (df
.groupby((df.index.dayofweek * 24) + (df.index.hour) + (df.index.minute / 60))
.transform(lambda x: x.fillna(x.mean()))
)
So what this does is it uses the average of the usage from the same hour of the day on the whole dataset. I want it to be more precise and use the average of the last two or three weeks.
You can concat together the Series with shift in a loop, as the index alignment will ensure it's matching on the previous weeks with the same hour. Then take the mean and use .fillna to update the original
Sample Data
import pandas as pd
import numpy as np
np.random.seed(5)
df = pd.DataFrame(index=pd.date_range('2010-01-01 10:00:00', freq='W', periods=10),
data = np.random.choice([1,2,3,4, np.NaN], 10),
columns=['mains_1'])
# mains_1
#2010-01-03 10:00:00 4.0
#2010-01-10 10:00:00 1.0
#2010-01-17 10:00:00 2.0
#2010-01-24 10:00:00 1.0
#2010-01-31 10:00:00 NaN
#2010-02-07 10:00:00 4.0
#2010-02-14 10:00:00 1.0
#2010-02-21 10:00:00 1.0
#2010-02-28 10:00:00 NaN
#2010-03-07 10:00:00 2.0
Code
# range(4) for previous 3 weeks.
df1 = pd.concat([df.shift(periods=x, freq='W') for x in range(4)], axis=1)
# mains_1 mains_1 mains_1 mains_1
#2010-01-03 10:00:00 4.0 NaN NaN NaN
#2010-01-10 10:00:00 1.0 4.0 NaN NaN
#2010-01-17 10:00:00 2.0 1.0 4.0 NaN
#2010-01-24 10:00:00 1.0 2.0 1.0 4.0
#2010-01-31 10:00:00 NaN 1.0 2.0 1.0
#2010-02-07 10:00:00 4.0 NaN 1.0 2.0
#2010-02-14 10:00:00 1.0 4.0 NaN 1.0
#2010-02-21 10:00:00 1.0 1.0 4.0 NaN
#2010-02-28 10:00:00 NaN 1.0 1.0 4.0
#2010-03-07 10:00:00 2.0 NaN 1.0 1.0
#2010-03-14 10:00:00 NaN 2.0 NaN 1.0
#2010-03-21 10:00:00 NaN NaN 2.0 NaN
#2010-03-28 10:00:00 NaN NaN NaN 2.0
df['mains_1'] = df['mains_1'].fillna(df1.mean(axis=1))
print(df)
mains_1
2010-01-03 10:00:00 4.000000
2010-01-10 10:00:00 1.000000
2010-01-17 10:00:00 2.000000
2010-01-24 10:00:00 1.000000
2010-01-31 10:00:00 1.333333
2010-02-07 10:00:00 4.000000
2010-02-14 10:00:00 1.000000
2010-02-21 10:00:00 1.000000
2010-02-28 10:00:00 2.000000
2010-03-07 10:00:00 2.000000

Incomplete filling when upsampling with `agg` for multiple columns (pandas resample)

I found this behavior of resample to be confusing after working on a related question. Here are some time series data at 5 minute intervals but with missing rows (code to construct at end):
user value total
2020-01-01 09:00:00 fred 1 1
2020-01-01 09:05:00 fred 13 1
2020-01-01 09:15:00 fred 27 3
2020-01-01 09:30:00 fred 40 12
2020-01-01 09:35:00 fred 15 12
2020-01-01 10:00:00 fred 19 16
I want to fill in the missing times using different methods for each column to fill missing data. For user and total, I want to to a forward fill, while for value I want to fill in with zeroes.
One approach I found was to resample, and then fill in the missing data after the fact:
resampled = df.resample('5T').asfreq()
resampled['user'].ffill(inplace=True)
resampled['total'].ffill(inplace=True)
resampled['value'].fillna(0, inplace=True)
Which gives correct expected output:
user value total
2020-01-01 09:00:00 fred 1.0 1.0
2020-01-01 09:05:00 fred 13.0 1.0
2020-01-01 09:10:00 fred 0.0 1.0
2020-01-01 09:15:00 fred 27.0 3.0
2020-01-01 09:20:00 fred 0.0 3.0
2020-01-01 09:25:00 fred 0.0 3.0
2020-01-01 09:30:00 fred 40.0 12.0
2020-01-01 09:35:00 fred 15.0 12.0
2020-01-01 09:40:00 fred 0.0 12.0
2020-01-01 09:45:00 fred 0.0 12.0
2020-01-01 09:50:00 fred 0.0 12.0
2020-01-01 09:55:00 fred 0.0 12.0
2020-01-01 10:00:00 fred 19.0 16.0
I thought one would be able to use agg to specify what to do by column. I try to do the following:
resampled = df.resample('5T').agg({'user':'ffill',
'value':'sum',
'total':'ffill'})
I find this to be more clear and simpler, but it doesn't give the expected output. The sum works, but the forward fill does not:
user value total
2020-01-01 09:00:00 fred 1 1.0
2020-01-01 09:05:00 fred 13 1.0
2020-01-01 09:10:00 NaN 0 NaN
2020-01-01 09:15:00 fred 27 3.0
2020-01-01 09:20:00 NaN 0 NaN
2020-01-01 09:25:00 NaN 0 NaN
2020-01-01 09:30:00 fred 40 12.0
2020-01-01 09:35:00 fred 15 12.0
2020-01-01 09:40:00 NaN 0 NaN
2020-01-01 09:45:00 NaN 0 NaN
2020-01-01 09:50:00 NaN 0 NaN
2020-01-01 09:55:00 NaN 0 NaN
2020-01-01 10:00:00 fred 19 16.0
Can someone explain this output, and if there is a way to achieve the expected output using agg? It seems odd that the forward fill doesn't work here, but if I were to just do resampled = df.resample('5T').ffill(), that would work for every column (but is undesired here as it would do so for the value column as well). The closest I have come is to individually run resampling for each column and apply the function I want:
resampled = pd.DataFrame()
d = {'user':'ffill',
'value':'sum',
'total':'ffill'}
for k, v in d.items():
resampled[k] = df[k].resample('5T').apply(v)
This works, but feels silly given that it adds extra iteration and uses the dictionary I am trying to pass to agg! I have looked a few posts on agg and apply but can't seem to explain what is happening here:
Losing String column when using resample and aggregation with pandas
resample multiple columns with pandas
pandas groupby with agg not working on multiple columns
Pandas named aggregation not working with resample agg
I have also tried using groupby with a pd.Grouper and using the pd.NamedAgg class, with no luck.
Example data:
import pandas as pd
dates = ['01-01-2020 9:00', '01-01-2020 9:05', '01-01-2020 9:15',
'01-01-2020 9:30', '01-01-2020 9:35', '01-01-2020 10:00']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'user':['fred']*len(dates),
'value':[1,13,27,40,15,19],
'total':[1,1,3,12,12,16]},
index=dates)

Categories