How do I run multiple functions on my aggregated pandas dataframe - python

I have the following data for wind speed and wind direction taken over the course of a month in Salt Lake City. I want to group by the hour data were taken. For the data taken within that hour, I want to accomplish two things: (1) calculate mean wind speed (2) apply a function I have defined ("yamatrino") to all the wind_direction measurements taken within each hour.
time Station_ID wind_speed wind_direction
0 2019-08-01 00:00:00 UTC WBB 3.48 96.1
1 2019-08-01 00:00:00 UTC UT215 6.54 141.4
2 2019-08-01 00:00:00 UTC MTMET 3.39 67.75
3 2019-08-01 00:00:00 UTC NAA 5.99 154.9
4 2019-08-01 00:00:00 UTC QHW 1.52 107
Below is the code I have written to (1) convert time data into a datetime format and (2) to create two columns with the mean wind speeds and yamatrino values for each hour of data.
df['time'] = pd.to_datetime(df['time'], format ='%Y-%m-%d %H:%M:%S UTC')
df.groupby(df['time'].dt.hour)['wind_direction', 'wind_speed'].agg([('yamatrino_value', lambda wind_direction: yamatrino(wind_direction)), ('hourly_velocity_mean', np.mean('wind_speed'))])
The error reads "TYPE ERROR: cannot perform reduce with flexible type"
I am confused how to aggregate with more than one column of data.

Consider using a dictionary inside DataFrame.groupby.agg call to run separate aggregate functions on separate columns. And if your method expects one parameter, lambda is not needed.
df.groupby(df['time'].dt.hour).agg({'wind_direction': yamatrino,
'wind_speed': np.mean})
And as of v0.25.0+, you can name aggregate columns which may be what you intended with yamatrino_value and hourly_velocity_mean . However, you need to use named tuples with fields: ['column', 'aggfunc'].
df.groupby(df['time'].dt.hour).agg(yamatrino_value = ('wind_direction', yamatrino),
hourly_velocity_mean = ('wind_speed', np.mean))

Related

Is there a way to find hourly averages in pandas timeframes that do not start from even hours?

I have a pandas dataframe (python) indexed with timestamps roughly every 10 seconds. I want to find hourly averages, but all functions I find start their averaging at even hours (e.g. hour 9 includes data from 08.00:00 to 08:59:50). Let's say I have the dataframe below.
Timestamp value data
2022-01-01 00:00:00 0.0 5.31
2022-01-01 00:00:10 0.0 0.52
2022-01-01 00:00:20 1.0 9.03
2022-01-01 00:00:30 1.0 4.37
2022-01-01 00:00:40 1.0 8.03
...
2022-01-01 13:52:30 1.0 9.75
2022-01-01 13:52:40 1.0 0.62
2022-01-01 13:52:50 1.0 3.58
2022-01-01 13:53:00 1.0 8.23
2022-01-01 13:53:10 1.0 3.07
Freq: 10S, Length: 5000, dtype: float64
So what I want to do:
Only look at data where we have data that consistently through 1 hour has a value of 1
Find an hourly average of these hours (could e.g. be between 01:30:00-02:29:50 and 11:16:30 - 12:16:20)..
I hope I made my problem clear enough. How do I do this?
EDIT:
Maybe the question was a bit unclear phrased.
I added a third column data, which is what I want to find the mean of. I am only interested in time intervals where, value = 1 consistently through one hour, the rest of the data can be excluded.
EDIT #2:
A bit of background to my problem: I have a sensor giving me data every 10 seconds. For data to be "approved" certain requirements are to be fulfilled (value in this example), and I need the hourly averages (and preferably timestamps for when this occurs). So in order to maximize the number of possible hours to include in my analysis, I would like to find full hours even if they don't start at an even timestamp.
If I understand you correctly you want a conditional mean - calculate the mean per hour of the data column conditional on the value column being all 1 for every 10s row in that hour.
Assuming your dataframe is called df, the steps to do this are:
Create a grouping column
This is your 'hour' column that can be created by
df['hour'] = df.Timestamp.hour
Create condition
Now we've got a column to identify groups we can check which groups are eligible - only those with value consistently equal to 1. If we have 10s intervals and it's per hour then if we group by hour and sum this column then we should get 360 as there are 360 10s intervals per hour.
Group and compute
We can now group and use the aggregate function to:
sum the value column to evaluate against our condition
compute the mean of the data column to return for the valid hours
# group and aggregate
df_mean = df[['hour', 'value', 'data']].groupby('hour').aggregate({'value': 'sum', 'data': 'mean'})
# apply condition
df_mean = df_mean[df_mean['value'] == 360]
That's it - you are left with a dataframe that contains the mean value of data for only the hours where you have a complete hour of value=1.
If you want to augment this so you don't have to start with the grouping as per hour starting as 08:00:00-09:00:00 and maybe you want to start as 08:00:10-09:00:10 then the solution is simple - augment the grouping column but don't change anything else in the process.
To do this you can use datetime.timedelta to shift things forward or back so that df.Timestamp.hour can still be leveraged to keep things simple.
Infer grouping from data
One final idea - if you want to infer which hours on a rolling basis you have complete data for then you can do this with a rolling sum - this is even easier. You:
compute the rolling sum of value and mean of data
only select where value is equal to 360
df_roll = df.rolling(360).aggregate({'value': 'sum', 'data': 'mean'})
df_roll = df_roll[df_roll['value'] == 360]
Yes, there is. You need resample with an offset.
Make some test data
Please make sure to provide meaningful test data next time.
import pandas as pd
import numpy as np
# One day in 10 second intervals
index = pd.date_range(start='1/1/2018', end='1/2/2018', freq='10S')
df = pd.DataFrame({"data": np.random.random(len(index))}, index=index)
# This will set the first part of the data to 1, the rest to 0
df["value"] = (df.index < "2018-01-01 10:00:10").astype(int)
This is what we got:
>>> df
data value
2018-01-01 00:00:00 0.377082 1
2018-01-01 00:00:10 0.574471 1
2018-01-01 00:00:20 0.284629 1
2018-01-01 00:00:30 0.678923 1
2018-01-01 00:00:40 0.094724 1
... ... ...
2018-01-01 23:59:20 0.839973 0
2018-01-01 23:59:30 0.890321 0
2018-01-01 23:59:40 0.426595 0
2018-01-01 23:59:50 0.089174 0
2018-01-02 00:00:00 0.351624 0
Get the mean per hour with an offset
Here is a small function that checks if all value rows in the slice are equal to 1 and returns the mean if so, otherwise it (implicitly) returns None.
def get_conditioned_average(frame):
if frame.value.eq(1).all():
return frame.data.mean()
Now just apply this to hourly slices, starting, e.g., at 10 seconds after the full hour.
df2 = df.resample('H', offset='10S').apply(get_conditioned_average)
This is the final result:
>>> df2
2017-12-31 23:00:10 0.377082
2018-01-01 00:00:10 0.522144
2018-01-01 01:00:10 0.506536
2018-01-01 02:00:10 0.505334
2018-01-01 03:00:10 0.504431
... ... ...
2018-01-01 19:00:10 NaN
2018-01-01 20:00:10 NaN
2018-01-01 21:00:10 NaN
2018-01-01 22:00:10 NaN
2018-01-01 23:00:10 NaN
Freq: H, dtype: float64

How to obtain difference of a date column in groupby

Currently my data looks like :
user_ID order_number order_start_date order_value week_day
237 135950 1594878.0 2018-01-01 534.0 Monday
235 32911 1594942.0 2018-01-01 89.0 Monday
232 208474 1594891.0 2018-01-01 85.0 Monday
231 9048 1594700.0 2018-01-01 224.0 Monday
228 134896 1594633.0 2018-01-01 449.0 Monday
What I want to achieve is groupby the records by user_ID and take difference of min and max value of each date and find out difference between them in days. Where I am struggling:
Groupby does not inherently supports minimum maximum difference
It is not possible to perform numerical operations such as mean() on datetime series which exist as a column in a dataframe. Though possible for individual series.
Any help?
I feel like your description was practically the pseudocode!
output = df.groupby('user_ID')['order_start_date'].apply(lambda g: g.max()-g.min())
You can then get the difference in days as numbers (rather than timedeltas):
output = [i / pd.Timedelta(days=1) for i in output]
The output on your example data is all 0 because there is only one entry per user, this is what you expect yes?
As for taking the mean, you just need to represent the dates as seconds since some time and then take the average. I had tried to convert all to timedeltas since an old time and then average, but this post does it better and works well with groupby. Here's a test scenario where its all data for one userID and the dates go from Jan 1st to Jan 5th, 2020:
df.loc[:,'user_ID'] = 1111
df['order_start_date'] = pd.date_range('01-01-2020','01-05-2020',periods=5)
df['order_start_date'] = np.array(df['order_start_date'],dtype='datetime64[s]').view('i8')
output = df.groupby('user_ID')['order_start_date'].mean().astype('datetime64[s]')
Results:
user_ID
1111 2020-01-03

How to select a subset of pandas DateTimeIndex whose data are in a list?

Lets say I have a idx=pd.DatatimeIndex with one minute frequency. I also have a list of bad dates (each are of type pd.Timestamp without the time information) that I want to remove from the original idx. How do I do that in pandas?
Use normalize to remove the time part from your index so you can do a simple ~ + isin selection, i.e. find the dates not in that bad list. You can further ensure your list of dates don't have a time part with the same [x.normalize() for x in bad_dates] if you need to be extra safe.
Sample Data
import pandas as pd
df = pd.DataFrame(range(9), index=pd.date_range('2010-01-01', freq='11H', periods=9))
bad_dates = [pd.Timestamp('2010-01-02'), pd.Timestamp('2010-01-03')]
Code
df[~df.index.normalize().isin(bad_dates)]
# 0
#2010-01-01 00:00:00 0
#2010-01-01 11:00:00 1
#2010-01-01 22:00:00 2
#2010-01-04 05:00:00 7
#2010-01-04 16:00:00 8

How to save or overwrite column values in a dataframe following for loop calculation

I have 3 columns in my dataframe, date time and snowfall. The snowfall data needs transforming from kg m-3 s-1 to mm/day. To do this, I divide by the density of snow and multiply by seconds in the day. (n/70)*86400
I would like to do this by overwriting the the snowfall column in the dataframe with the calculation for plotting.
I have the loop function which transforms the values but will not append the result to the list (which is the basic version of what I was looking for), but ideally I would like to have it simply overwrite the column in the dataframe for ease of plotting.
def snowfallconverter(in1):
snowfall_fix = []
for ii in in1:
snowfall_mm = (ii/70)*8600
snowfall_fix.append(snowfall_mm)
The dataset looks like:
date time snowfall
01/11/2017 12:00:00 -4.43e-06
02/11/2017 12:00:00 -9.04e-08
Thank you in advance.
Loop here is not necessary, divide and multiple column by scalars only:
df['snowfall_mm'] = df['snowfall'] / 70 * 8600
print (df)
date time snowfall snowfall_mm
0 01/11/2017 12:00:00 -4.430000e-06 -0.000544
1 02/11/2017 12:00:00 -9.040000e-08 -0.000011
Or overwrite same column:
df['snowfall'] = df['snowfall'] / 70 * 8600
print (df)
date time snowfall
0 01/11/2017 12:00:00 -0.000544
1 02/11/2017 12:00:00 -0.000011

Python pandas calculating average price within 15 minute time frame

I have a dateframe that contains datetime and price.
Here is a sample chosen at random
In [2]: df
Out[2]:
price datetime
239035 5.05 2016-04-14 14:13:27
1771224 5.67 2016-08-30 14:19:47
2859140 4.00 2016-12-05 20:57:01
1311384 7.71 2016-07-08 18:16:22
141709 4.19 2016-04-07 13:30:00
2802527 3.94 2016-11-30 15:36:11
1411955 7.27 2016-07-20 13:55:20
2215987 4.87 2016-10-07 19:56:13
The datetime is accurate to the second.
I want to calculate the average price every 15 minutes starting at 9:00am and ending at 4:30pm, and store the new data into a new dataframe.
I could do it the old fashion way, make a list of all the 15 minute time intervals within 9am-4:30pm for each date, and iterate through each row of the CSV file, check its time and dump it into the appropriate bucket. Then find the average value for each bucket in each day.
But I was wondering if there is a nicer way to do this in panda. if not I'll just brute force my way through it...
You can use DataFrame.resample:
df2 = df.resample(rule='15Min', on='datetime').mean()
Then you filter out the times you don't want using boolean indexing. It's better to work with a DateTimeIndex:
df2 = df2.set_index('datetime', drop=False)
df2.between_time('9:00','16:30')

Categories