I have a dataframe df which looks like
time(float)
value (float)
10.45
10
10.50
20
10.55
25
11.20
30
11.44
20
12.30
30
I need help to calculate a new column called rolling_average_value which is basically the average value of that row and all the values 1 hour before that row such that the new dataframe looks like.
time(float)
value (float)
rolling_average_value
10.45
10
10
10.50
20
15
10.55
25
18.33
11.20
30
21.25
11.44
20
21
12.30
30
25
Note: This time column is a float column
You can temporarily set a datetime index and apply rolling.mean:
# extract hours/minuts from float
import numpy as np
minutes, hours = np.modf(df['time(float)'])
hours = hours.astype(int)
minutes = minutes.mul(100).astype(int)
dt = pd.to_datetime(hours.astype(str)+minutes.astype(str), format='%H%M')
# perform rolling computation
df['rolling_mean'] = (df.set_axis(dt)
.rolling('1h')['value (float)']
.mean()
.set_axis(df.index)
)
output:
time(float) value (float) rolling_mean
0 10.45 10 10.000000
1 10.50 20 15.000000
2 10.55 25 18.333333
3 11.20 30 21.250000
4 11.44 20 21.000000
5 12.30 30 25.000000
Alternative to compute dt:
dt = pd.to_datetime(df['time(float)'].astype(str)
.str.replace('\d+', lambda x: x.group().zfill(2),
regex=True),
format='%H.%M')
Assuming your data frame is sorted by time, you can also use a simple list comprehension to solve your problem. Iterate over times and get all indices where the distance from the previous time values to the actual iteration value is less than one (meaning less than one hour) and slice the value column that was converted to an array by those indices. Then, you can just compute the mean of the sliced array:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{"time": [10.45, 10.5, 10.55, 11.2, 11.44, 12.3],
"value": [10, 20, 25, 30, 20, 30]}
)
times = df["time"].values
values = df["value"].values
df["rolling_mean"] = [round(np.mean(values[np.where(times[i] - times[:i+1] < 1)[0]]), 2) for i in range(len(times))]
If your data frame is large, you can compile this loop in C/C++ too make it significantly faster:
from numba import njit
#njit
def compute_rolling_mean(times, values):
return [round(np.mean(values[np.where(times[i] - times[:i+1] < 1)[0]]), 2) for i in range(len(times))]
df["rolling_mean"] = compute_rolling_mean(df["time"].values, df["value"].values)
Output:
time value rolling_mean
0 10.45 10 10.00
1 10.50 20 15.00
2 10.55 25 18.33
3 11.20 30 21.25
4 11.44 20 21.00
5 12.30 30 25.00
Related
Using Python with pandas to export data from a database to csv.Data looks like this when exported. Got like 100 logs/day so this is pure for visualising purpose:
time
Buf1
Buf2
12/12/2022 19:15:56
12
3
12/12/2022 18:00:30
5
18
11/12/2022 15:15:08
12
3
11/12/2022 15:15:08
10
9
Now i only show the "raw" data into a csv but i am in need to generate for each day a min. max. and avg value. Whats the best way to create that ? i've been trying to do some min() max() functions but the problem here is that i've multiple days in these csv files. Also trying to manupilate the data in python it self but kinda worried about that i'll be missing some and the data will be not correct any more.
I would like to end up with something like this:
time
buf1_max
buf_min
12/12/2022
12
3
12/12/2022
12
10
Here you go, step by step.
In [27]: df['time'] = df['time'].astype("datetime64").dt.date
In [28]: df
Out[28]:
time Buf1 Buf2
0 2022-12-12 12 3
1 2022-12-12 5 18
2 2022-11-12 12 3
3 2022-11-12 10 9
In [29]: df = df.set_index("time")
In [30]: df
Out[30]:
Buf1 Buf2
time
2022-12-12 12 3
2022-12-12 5 18
2022-11-12 12 3
2022-11-12 10 9
In [31]: df.groupby(df.index).agg(['min', 'max', 'mean'])
Out[31]:
Buf1 Buf2
min max mean min max mean
time
2022-11-12 10 12 11.0 3 9 6.0
2022-12-12 5 12 8.5 3 18 10.5
Another approach is to use pivot_table for simplification of grouping data (keep in mind to convert 'time' column to datetime64 format as suggested:
import pandas as pd
import numpy as np
df.pivot_table(
index='time',
values=['Buf1', 'Buf2'],
aggfunc={'Buf1':[min, max, np.mean], 'Buf2':[min, max, np.mean]}
)
You can add any aggfunc as you wish.
I have a dataframe out:
dates min max wh
0 2005-09-06 07:41:18 21:59:57 14:18:39
1 2005-09-12 14:49:22 14:49:22 00:00:00
2 2005-09-19 11:08:56 11:24:05 00:15:09
3 2005-09-21 21:19:21 21:20:15 00:00:54
4 2005-09-22 19:41:52 19:41:52 00:00:00
5 2005-10-13 11:22:07 21:05:41 09:43:34
6 2005-11-22 11:53:12 21:21:22 09:28:10
7 2005-11-23 00:07:01 14:08:50 14:01:49
8 2005-11-30 13:42:48 23:59:19 10:16:31
9 2005-12-01 00:05:16 10:24:12 10:18:56
10 2005-12-21 17:38:43 19:26:03 01:47:20
11 2005-12-22 09:20:07 11:25:40 02:05:33
12 2006-01-23 07:46:20 08:01:52 00:15:32
13 2006-04-27 16:27:54 19:29:52 03:01:58
14 2006-05-11 12:48:34 23:10:44 10:22:10
15 2006-05-15 10:14:59 22:28:12 12:13:13
16 2006-05-16 01:14:07 23:55:51 22:41:44
17 2006-05-17 01:12:45 23:57:56 22:45:11
18 2006-05-18 02:42:08 21:48:49 19:06:41
and I want the average workhours per day (which presents the column wh) per month.
out['dates'] = pd.to_datetime(out['dates'])
out['month']= pd.PeriodIndex(out.dates, freq='M')
out2=out.groupby('month')['wh'].mean().reset_index(name='wh2')
I used this so far, but the values in wh are no numeric data so I can't build the mean. How can I convert the whole column wh build the mean?
My wh was made by the following:
df = pd.read_csv("Testordner2/"+i, parse_dates=True)
df['new_time'] = pd.to_datetime(df['new_time'])
df['dates']= df['new_time'].dt.date
df['time'] = df['new_time'].dt.time
out = df.groupby(df['dates']).agg({'time': ['min', 'max']}) \
.stack(level=0).droplevel(1)
out['min_as_time_format'] = pd.to_datetime(out['min'], format="%H:%M:%S")
out['max_as_time_format'] = pd.to_datetime(out['max'], format="%H:%M:%S")
out['wh'] = out['max_as_time_format'] - out['min_as_time_format']
out['wh'].astype(str).str[-18:-10]
One possible solution is convert timedeltas to native format, aggregate mean and then convert back to timedeltas:
out['dates'] = pd.to_datetime(out['dates'])
out['month']= pd.PeriodIndex(out.dates, freq='M')
out['wh'] = pd.to_timedelta(out['wh']).astype(np.int64)
out2=pd.to_timedelta(out.groupby('month')['wh'].mean()).reset_index(name='wh2')
print (out2)
month wh2
0 2005-09 02:54:56.400000
1 2005-10 09:43:34
2 2005-11 11:15:30
3 2005-12 04:43:56.333333
4 2006-01 00:15:32
5 2006-04 03:01:58
6 2006-05 17:25:47.800000
I have two dataframes from excels which look like the below. The first dataframe has a multi-index header.
I am trying to find the correlation between each column in the dataframe with the corresponding dataframe based on the currency (i.e KRW, THB, USD, INR). At the moment, I am doing a loop to iterate through each column, matching by index and corresponding header before finding the correlation.
for stock_name in index_data.columns.get_level_values(0):
stock_prices = index_data.xs(stock_name, level=0, axis=1)
stock_prices = stock_prices.dropna()
fx = currency_data[stock_prices.columns.get_level_values(1).values[0]]
fx = fx[fx.index.isin(stock_prices.index)]
merged_df = pd.merge(stock_prices, fx, left_index=True, right_index=True)
merged_df[0].corr(merged_df[1])
Is there a more panda-ish way of doing this?
So you wish to find the correlation between the stock price and its related currency. (Or stock price correlation to all currencies?)
# dummy data
date_range = pd.date_range('2019-02-01', '2019-03-01', freq='D')
stock_prices = pd.DataFrame(
np.random.randint(1, 20, (date_range.shape[0], 4)),
index=date_range,
columns=[['BYZ6DH', 'BLZGSL', 'MBT', 'BAP'],
['KRW', 'THB', 'USD', 'USD']])
fx = pd.DataFrame(np.random.randint(1, 20, (date_range.shape[0], 3)),
index=date_range, columns=['KRW', 'THB', 'USD'])
This is what it looks like, calculating correlations on this data shouldn't make much sense since it is random.
>>> print(stock_prices.head())
BYZ6DH BLZGSL MBT BAP
KRW THB USD USD
2019-02-01 15 10 19 19
2019-02-02 5 9 19 5
2019-02-03 19 7 18 10
2019-02-04 1 6 7 18
2019-02-05 11 17 6 7
>>> print(fx.head())
KRW THB USD
2019-02-01 15 11 10
2019-02-02 6 5 3
2019-02-03 13 1 3
2019-02-04 19 8 14
2019-02-05 6 13 2
Use apply to calculate the correlation between columns with the same currency.
def f(x, fx):
correlation = x.corr(fx[x.name[1]])
return correlation
correlation = stock_prices.apply(f, args=(fx,), axis=0)
>>> print(correlation)
BYZ6DH KRW -0.247529
BLZGSL THB 0.043084
MBT USD -0.471750
BAP USD 0.314969
dtype: float64
I have the following DataFrame:
actor Daily Total actor1 actor2
Day
2019-01-01 25 10 15
2019-01-02 30 15 15
Avg 27.5 12.5 15.0
How do I change the data type of 'Avg' row to integer? How do I round those values in the row?
In pandas after add new row filled by floats all columns are changed to floats.
Possible solution is round and convert all columns:
df = df.round().astype(int)
Or add new Series converted to integer:
df = df.append(df.mean().rename('Avg').round().astype(int))
print (df)
Daily Total actor1 actor2
actor
2019-01-01 25 10 15
2019-01-02 30 15 15
Avg 28 12 15
If want convert only columns with row values filled by whole numbers:
d = dict.fromkeys(df.columns[df.loc['Avg'] == df.loc['Avg'].astype(int)], 'int')
df = df.astype(d)
print (df)
Daily Total actor1 actor2
actor
2019-01-01 25.0 10.0 15
2019-01-02 30.0 15.0 15
Avg 27.5 12.5 15
Use loc to access index then use numpy.round in apply.
import numpy as np
df.loc['Avg'] = df.loc['Avg'].apply(np.round)
My data has ages, and also payments per month.
I'm trying to aggregate summing the payments, but without summing the ages (averaging would work).
Is it possible to use different functions for different columns?
You can pass a dictionary to agg with column names as keys and the functions you want as values.
import pandas as pd
import numpy as np
# Create some randomised data
N = 20
date_range = pd.date_range('01/01/2015', periods=N, freq='W')
df = pd.DataFrame({'ages':np.arange(N), 'payments':np.arange(N)*10}, index=date_range)
print(df.head())
# ages payments
# 2015-01-04 0 0
# 2015-01-11 1 10
# 2015-01-18 2 20
# 2015-01-25 3 30
# 2015-02-01 4 40
# Apply np.mean to the ages column and np.sum to the payments.
agg_funcs = {'ages':np.mean, 'payments':np.sum}
# Groupby each individual month and then apply the funcs in agg_funcs
grouped = df.groupby(df.index.to_period('M')).agg(agg_funcs)
print(grouped)
# ages payments
# 2015-01 1.5 60
# 2015-02 5.5 220
# 2015-03 10.0 500
# 2015-04 14.5 580
# 2015-05 18.0 540