I'm trying to use pandas to resample 15 minute periods into 1 hour periods but by applying a custom function. My DataFrame is in this format;
Date val1 val2
2016-01-30 07:00:00 49.0 45.0
2016-01-30 07:15:00 49.0 44.0
2016-01-30 07:30:00 52.0 47.0
2016-01-30 07:45:00 60.0 46.0
2016-01-30 08:00:00 63.0 61.0
2016-01-30 08:15:00 61.0 60.0
2016-01-30 08:30:00 62.0 61.0
2016-01-30 08:45:00 63.0 61.0
2016-01-30 09:00:00 68.0 60.0
2016-01-30 09:15:00 71.0 70.0
2016-01-30 09:30:00 71.0 70.0
..and i want to resample with this function;
def log_add(array_like):
return (10*math.log10((sum([10**(i/10) for i in array_like])))))
I do;
df.resample('1H').apply(log_add)
but this returns an empty df, doing this;
df.resample('1H').apply(lambda x: log_add(x))
does the same too. Anyone any ideas why its not applying the function properly?
Any help would be appreciated, thanks.
You can add parameter on what is implemented in 0.19.0 pandas:
print (df.resample('1H', on='Date').apply(log_add))
Or set Date to index by set_index:
df.set_index('Date', inplace=True)
print (df.resample('1H').apply(log_add))
Also first check if dtype of column Date is datetime, if not use to_datetime:
print (df.dtypes)
Date object
val1 float64
val2 float64
dtype: object
df.Date = pd.to_datetime(df.Date)
print (df.dtypes)
Date datetime64[ns]
val1 float64
val2 float64
dtype: object
Related
I have a DataFrame of the following form:
You see that it has a multi index. For each muni index I want to do a resampling of the form .resample('A').mean() of the popDate index. Hence, I want python to fill in the missing years. NaN values shall be replaced by a linear interpolation. How do I do that?
Update: Some mock input DataFrame:
interData=pd.DataFrame({'muni':['Q1','Q1','Q1','Q2','Q2','Q2'],'popDate':['2015','2021','2022','2015','2017','2022'],'population':[5,11,22,15,17,22]})
interData['popDate']=pd.to_datetime(interData['popDate'])
interData=interData.set_index(['muni','popDate'])
It looks like you want a groupby.resample:
interData.groupby(level='muni').resample('A', level='popDate').mean()
Output:
population
muni popDate
Q1 2015-12-31 5.0
2016-12-31 NaN
2017-12-31 NaN
2018-12-31 NaN
2019-12-31 NaN
2020-12-31 NaN
2021-12-31 11.0
2022-12-31 22.0
Q2 2015-12-31 15.0
2016-12-31 NaN
2017-12-31 17.0
2018-12-31 NaN
2019-12-31 NaN
2020-12-31 NaN
2021-12-31 NaN
2022-12-31 22.0
If you also need interpolation, combine with interpolate:
out = (interData.groupby(level='muni')
.apply(lambda g: g.resample('A', level='popDate').mean()
.interpolate(method='time'))
)
Output:
population
muni popDate
Q1 2015-12-31 5.000000
2016-12-31 6.001825
2017-12-31 7.000912
2018-12-31 8.000000
2019-12-31 8.999088
2020-12-31 10.000912
2021-12-31 11.000000
2022-12-31 22.000000
Q2 2015-12-31 15.000000 # 366 days between 2015-12-31 and 2016-12-31
2016-12-31 16.001368 # 365 days between 2016-12-31 and 2017-12-31
2017-12-31 17.000000
2018-12-31 17.999452
2019-12-31 18.998905
2020-12-31 20.001095
2021-12-31 21.000548
2022-12-31 22.000000
I found this behavior of resample to be confusing after working on a related question. Here are some time series data at 5 minute intervals but with missing rows (code to construct at end):
user value total
2020-01-01 09:00:00 fred 1 1
2020-01-01 09:05:00 fred 13 1
2020-01-01 09:15:00 fred 27 3
2020-01-01 09:30:00 fred 40 12
2020-01-01 09:35:00 fred 15 12
2020-01-01 10:00:00 fred 19 16
I want to fill in the missing times using different methods for each column to fill missing data. For user and total, I want to to a forward fill, while for value I want to fill in with zeroes.
One approach I found was to resample, and then fill in the missing data after the fact:
resampled = df.resample('5T').asfreq()
resampled['user'].ffill(inplace=True)
resampled['total'].ffill(inplace=True)
resampled['value'].fillna(0, inplace=True)
Which gives correct expected output:
user value total
2020-01-01 09:00:00 fred 1.0 1.0
2020-01-01 09:05:00 fred 13.0 1.0
2020-01-01 09:10:00 fred 0.0 1.0
2020-01-01 09:15:00 fred 27.0 3.0
2020-01-01 09:20:00 fred 0.0 3.0
2020-01-01 09:25:00 fred 0.0 3.0
2020-01-01 09:30:00 fred 40.0 12.0
2020-01-01 09:35:00 fred 15.0 12.0
2020-01-01 09:40:00 fred 0.0 12.0
2020-01-01 09:45:00 fred 0.0 12.0
2020-01-01 09:50:00 fred 0.0 12.0
2020-01-01 09:55:00 fred 0.0 12.0
2020-01-01 10:00:00 fred 19.0 16.0
I thought one would be able to use agg to specify what to do by column. I try to do the following:
resampled = df.resample('5T').agg({'user':'ffill',
'value':'sum',
'total':'ffill'})
I find this to be more clear and simpler, but it doesn't give the expected output. The sum works, but the forward fill does not:
user value total
2020-01-01 09:00:00 fred 1 1.0
2020-01-01 09:05:00 fred 13 1.0
2020-01-01 09:10:00 NaN 0 NaN
2020-01-01 09:15:00 fred 27 3.0
2020-01-01 09:20:00 NaN 0 NaN
2020-01-01 09:25:00 NaN 0 NaN
2020-01-01 09:30:00 fred 40 12.0
2020-01-01 09:35:00 fred 15 12.0
2020-01-01 09:40:00 NaN 0 NaN
2020-01-01 09:45:00 NaN 0 NaN
2020-01-01 09:50:00 NaN 0 NaN
2020-01-01 09:55:00 NaN 0 NaN
2020-01-01 10:00:00 fred 19 16.0
Can someone explain this output, and if there is a way to achieve the expected output using agg? It seems odd that the forward fill doesn't work here, but if I were to just do resampled = df.resample('5T').ffill(), that would work for every column (but is undesired here as it would do so for the value column as well). The closest I have come is to individually run resampling for each column and apply the function I want:
resampled = pd.DataFrame()
d = {'user':'ffill',
'value':'sum',
'total':'ffill'}
for k, v in d.items():
resampled[k] = df[k].resample('5T').apply(v)
This works, but feels silly given that it adds extra iteration and uses the dictionary I am trying to pass to agg! I have looked a few posts on agg and apply but can't seem to explain what is happening here:
Losing String column when using resample and aggregation with pandas
resample multiple columns with pandas
pandas groupby with agg not working on multiple columns
Pandas named aggregation not working with resample agg
I have also tried using groupby with a pd.Grouper and using the pd.NamedAgg class, with no luck.
Example data:
import pandas as pd
dates = ['01-01-2020 9:00', '01-01-2020 9:05', '01-01-2020 9:15',
'01-01-2020 9:30', '01-01-2020 9:35', '01-01-2020 10:00']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'user':['fred']*len(dates),
'value':[1,13,27,40,15,19],
'total':[1,1,3,12,12,16]},
index=dates)
I have a dataframe that looks roughly like:
01/01/19 02/01/19 03/01/19 04/01/19
hour
1.0 27.08 47.73 54.24 10.0
2.0 26.06 49.53 46.09 22.0
...
24.0 12.0 34.0 22.0 40.0
I'd like to reduce its dimension to a single column with a proper date index concatenating all the columns. Is there a smart pandas way to do it?
Expected result... something like:
01/01/19 00:00:00 27.08
01/01/19 01:00:00 26.08
...
01/01/19 23:00:00 12.00
02/01/19 00:00:00 47.73
02/01/19 01:00:00 49.53
...
02/01/19 23:00:00 34.00
...
You can stack and then fix the index using pd.to_datetime and pd.to_timedelta:
u = df.stack()
u.index = (pd.to_datetime(u.index.get_level_values(1), dayfirst=True)
+ pd.to_timedelta(u.index.get_level_values(0) - 1, unit='h'))
u.sort_index()
2019-01-01 00:00:00 27.08
2019-01-01 01:00:00 26.06
2019-01-01 23:00:00 12.00
2019-01-02 00:00:00 47.73
2019-01-02 01:00:00 49.53
2019-01-02 23:00:00 34.00
2019-01-03 00:00:00 54.24
2019-01-03 01:00:00 46.09
2019-01-03 23:00:00 22.00
2019-01-04 00:00:00 10.00
2019-01-04 01:00:00 22.00
2019-01-04 23:00:00 40.00
dtype: float64
In Python 3.6.3, I have the following dataframe df1:
dt Val
2017-04-10 08:00:00 8.0
2017-04-10 09:00:00 2.0
2017-04-10 10:00:00 7.0
2017-04-11 08:00:00 3.0
2017-04-11 09:00:00 0.0
2017-04-11 10:00:00 5.0
2017-11-26 08:00:00 8.0
2017-11-26 09:00:00 1.0
2017-11-26 10:00:00 2.0
I am trying to compute the hourly average of these values, so as to have:
Hour Val
08:00:00 7.00
09:00:00 1.00
10:00:00 4.66
My attempt:
df2 = df1.resample('H')['Val'].mean()
Returns the same dataset as df1. What am I doing wrong?
Inspired by the comments above, I tested that the following works for me:
df.groupby(df.index.hour).Val.mean()
Or you can make the index values 'timedelta' dtypes
df.Val.groupby(df.index.hour.astype('timedelta64[h]')).mean()
dt
08:00:00 6.333333
09:00:00 1.000000
10:00:00 4.666667
Name: Val, dtype: float64
A dataframe contains only a few timestamps per day and I need to select the latest one for each date (not the values, the time stamp itself). the df looks like this:
A B C
2016-12-05 12:00:00+00:00 126.0 15.0 38.54
2016-12-05 16:00:00+00:00 131.0 20.0 42.33
2016-12-14 05:00:00+00:00 129.0 18.0 43.24
2016-12-15 03:00:00+00:00 117.0 22.0 33.70
2016-12-15 04:00:00+00:00 140.0 23.0 34.81
2016-12-16 03:00:00+00:00 120.0 21.0 32.24
2016-12-16 04:00:00+00:00 142.0 22.0 35.20
I managed to achieve what i needed by defining the following function:
def find_last_h(df,column):
newindex = []
df2 = df.resample('d').last().dropna()
for x in df2[column].values:
newindex.append(df[df[column]==x].index.values[0])
return pd.DatetimeIndex(newindex)
with which I specify which column's values to use as a filter to get the desired timestamps. The issue here is in the case of non unique values this might not work as desired.
Another way that is used is:
grouped = df.groupby([df.index.day,df.index.hour])
grouped.groupby(level=0).last()
and then reconstruct the timestamps but it is even more verbose. What is the smart way?
Use boolean indexing with mask created by duplicated and floor for truncate times:
idx = df.index.floor('D')
df = df[~idx.duplicated(keep='last') | ~idx.duplicated(keep=False)]
print (df)
A B C
2016-12-05 16:00:00 131.0 20.0 42.33
2016-12-14 05:00:00 129.0 18.0 43.24
2016-12-15 04:00:00 140.0 23.0 34.81
2016-12-16 04:00:00 142.0 22.0 35.20
Another solution with reset_index + set_index:
df = df.reset_index().groupby([df.index.date]).last().set_index('index')
print (df)
A B C
index
2016-12-05 16:00:00 131.0 20.0 42.33
2016-12-14 05:00:00 129.0 18.0 43.24
2016-12-15 04:00:00 140.0 23.0 34.81
2016-12-16 04:00:00 142.0 22.0 35.20
resample and groupby dates only lost times:
print (df.resample('1D').last().dropna())
A B C
2016-12-05 131.0 20.0 42.33
2016-12-14 129.0 18.0 43.24
2016-12-15 140.0 23.0 34.81
2016-12-16 142.0 22.0 35.20
print (df.groupby([df.index.date]).last())
A B C
2016-12-05 131.0 20.0 42.33
2016-12-14 129.0 18.0 43.24
2016-12-15 140.0 23.0 34.81
2016-12-16 142.0 22.0 35.20
how about
df.resample('24H',kind='period').last().dropna() ?
You can groupby the date and just take the max of each datetime to get the last datetime on each date.
This may look like:
df.groupby(df["datetime"].dt.date)["datetime"].max()
or something like
df.groupby(pd.Grouper(freq='D'))["datetime"].max()