Pandas: using multiple functions in a group by - python

My data has ages, and also payments per month.
I'm trying to aggregate summing the payments, but without summing the ages (averaging would work).
Is it possible to use different functions for different columns?

You can pass a dictionary to agg with column names as keys and the functions you want as values.
import pandas as pd
import numpy as np
# Create some randomised data
N = 20
date_range = pd.date_range('01/01/2015', periods=N, freq='W')
df = pd.DataFrame({'ages':np.arange(N), 'payments':np.arange(N)*10}, index=date_range)
print(df.head())
# ages payments
# 2015-01-04 0 0
# 2015-01-11 1 10
# 2015-01-18 2 20
# 2015-01-25 3 30
# 2015-02-01 4 40
# Apply np.mean to the ages column and np.sum to the payments.
agg_funcs = {'ages':np.mean, 'payments':np.sum}
# Groupby each individual month and then apply the funcs in agg_funcs
grouped = df.groupby(df.index.to_period('M')).agg(agg_funcs)
print(grouped)
# ages payments
# 2015-01 1.5 60
# 2015-02 5.5 220
# 2015-03 10.0 500
# 2015-04 14.5 580
# 2015-05 18.0 540

Related

Python pandas Get daily: MIN MAX AVG results of datasets

Using Python with pandas to export data from a database to csv.Data looks like this when exported. Got like 100 logs/day so this is pure for visualising purpose:
time
Buf1
Buf2
12/12/2022 19:15:56
12
3
12/12/2022 18:00:30
5
18
11/12/2022 15:15:08
12
3
11/12/2022 15:15:08
10
9
Now i only show the "raw" data into a csv but i am in need to generate for each day a min. max. and avg value. Whats the best way to create that ? i've been trying to do some min() max() functions but the problem here is that i've multiple days in these csv files. Also trying to manupilate the data in python it self but kinda worried about that i'll be missing some and the data will be not correct any more.
I would like to end up with something like this:
time
buf1_max
buf_min
12/12/2022
12
3
12/12/2022
12
10
Here you go, step by step.
In [27]: df['time'] = df['time'].astype("datetime64").dt.date
In [28]: df
Out[28]:
time Buf1 Buf2
0 2022-12-12 12 3
1 2022-12-12 5 18
2 2022-11-12 12 3
3 2022-11-12 10 9
In [29]: df = df.set_index("time")
In [30]: df
Out[30]:
Buf1 Buf2
time
2022-12-12 12 3
2022-12-12 5 18
2022-11-12 12 3
2022-11-12 10 9
In [31]: df.groupby(df.index).agg(['min', 'max', 'mean'])
Out[31]:
Buf1 Buf2
min max mean min max mean
time
2022-11-12 10 12 11.0 3 9 6.0
2022-12-12 5 12 8.5 3 18 10.5
Another approach is to use pivot_table for simplification of grouping data (keep in mind to convert 'time' column to datetime64 format as suggested:
import pandas as pd
import numpy as np
df.pivot_table(
index='time',
values=['Buf1', 'Buf2'],
aggfunc={'Buf1':[min, max, np.mean], 'Buf2':[min, max, np.mean]}
)
You can add any aggfunc as you wish.

Rolling average based on another column

I have a dataframe df which looks like
time(float)
value (float)
10.45
10
10.50
20
10.55
25
11.20
30
11.44
20
12.30
30
I need help to calculate a new column called rolling_average_value which is basically the average value of that row and all the values 1 hour before that row such that the new dataframe looks like.
time(float)
value (float)
rolling_average_value
10.45
10
10
10.50
20
15
10.55
25
18.33
11.20
30
21.25
11.44
20
21
12.30
30
25
Note: This time column is a float column
You can temporarily set a datetime index and apply rolling.mean:
# extract hours/minuts from float
import numpy as np
minutes, hours = np.modf(df['time(float)'])
hours = hours.astype(int)
minutes = minutes.mul(100).astype(int)
dt = pd.to_datetime(hours.astype(str)+minutes.astype(str), format='%H%M')
# perform rolling computation
df['rolling_mean'] = (df.set_axis(dt)
.rolling('1h')['value (float)']
.mean()
.set_axis(df.index)
)
output:
time(float) value (float) rolling_mean
0 10.45 10 10.000000
1 10.50 20 15.000000
2 10.55 25 18.333333
3 11.20 30 21.250000
4 11.44 20 21.000000
5 12.30 30 25.000000
Alternative to compute dt:
dt = pd.to_datetime(df['time(float)'].astype(str)
.str.replace('\d+', lambda x: x.group().zfill(2),
regex=True),
format='%H.%M')
Assuming your data frame is sorted by time, you can also use a simple list comprehension to solve your problem. Iterate over times and get all indices where the distance from the previous time values to the actual iteration value is less than one (meaning less than one hour) and slice the value column that was converted to an array by those indices. Then, you can just compute the mean of the sliced array:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{"time": [10.45, 10.5, 10.55, 11.2, 11.44, 12.3],
"value": [10, 20, 25, 30, 20, 30]}
)
times = df["time"].values
values = df["value"].values
df["rolling_mean"] = [round(np.mean(values[np.where(times[i] - times[:i+1] < 1)[0]]), 2) for i in range(len(times))]
If your data frame is large, you can compile this loop in C/C++ too make it significantly faster:
from numba import njit
#njit
def compute_rolling_mean(times, values):
return [round(np.mean(values[np.where(times[i] - times[:i+1] < 1)[0]]), 2) for i in range(len(times))]
df["rolling_mean"] = compute_rolling_mean(df["time"].values, df["value"].values)
Output:
time value rolling_mean
0 10.45 10 10.00
1 10.50 20 15.00
2 10.55 25 18.33
3 11.20 30 21.25
4 11.44 20 21.00
5 12.30 30 25.00

Trouble with for loop in a function and combing multiple series output

I'm new to python and am struggling to figure something out. I'm doing some data analysis on an invoice database in pandas with columns of $ amounts, credits, date, and a unique company ID for each package bought.
I want to run every unique company id through a function that will calculate the average spend rate of these credits based on the difference of package purchase dates. I have the basics figured out in my function, and it returns a series indexed to the original dataframe with the values of the average amount of credits spent each day between packages. However, I only have it working with one company ID at a time, and I don'tknow what kind of process I can do to combine all of these different series for each company id to be able to correctly add a new column onto my dataframe with this average credit spend value for each package. Here's my code so far:
def creditspend(mylist = []):
for i in mylist:
a = df.loc[df['CompanyId'] == i]
a = a.sort_values(by=['Date'], ascending=False)
days = a.Date.diff().map(lambda x: abs(x.days))
spend = a['Credits']/days
print(spend)
If I call
creditspend(mylist=[8, 15])
(with multiple inputs) it obviously does not work. What do I need to do to complete this function?
Thanks in advance.
apply() is a very useful method in pandas that applies a function to every row or column of a DataFrame.
So, if your DataFrame is df:
def creditspend(row):
# some calculation code here
return spend
df['spend_rate'] = df.apply(creditspend)
(You can also use apply() on columns with the axis=1 keyword.)
Consider a groupby for a CompanyID aggregation. Below demonstrates with random data:
import numpy as np
import pandas as pd
np.random.seed(7182018)
df = pd.DataFrame({'CompanyID': np.random.choice(['julia', 'pandas', 'r', 'sas', 'stata', 'spss'],50),
'Date': np.random.choice(pd.Series(pd.date_range('2018-01-01', freq='D', periods=180)), 50),
'Credits': np.random.uniform(0,1000,50)
}, columns=['Date', 'CompanyID', 'Credits'])
# SORT ONCE OUTSIDE OF PROCESSING
df = df.sort_values(by=['CompanyID', 'Date'], ascending=[True, False]).reset_index(drop=True)
def creditspend(g):
g['days'] = g.Date.diff().map(lambda x: abs(x.days))
g['spend'] = g['Credits']/g['days']
return g
grp_df = df.groupby('CompanyID').apply(creditspend)
Output
print(grp_df.head(20))
# Date CompanyID Credits days spend
# 0 2018-06-20 julia 280.522287 NaN NaN
# 1 2018-06-12 julia 985.009523 8.0 123.126190
# 2 2018-05-17 julia 892.308179 26.0 34.319545
# 3 2018-05-03 julia 97.410360 14.0 6.957883
# 4 2018-03-26 julia 480.206077 38.0 12.637002
# 5 2018-03-07 julia 78.892365 19.0 4.152230
# 6 2018-03-03 julia 878.671506 4.0 219.667877
# 7 2018-02-25 julia 905.172807 6.0 150.862135
# 8 2018-02-19 julia 970.016418 6.0 161.669403
# 9 2018-02-03 julia 669.073067 16.0 41.817067
# 10 2018-01-23 julia 636.926865 11.0 57.902442
# 11 2018-01-11 julia 790.107486 12.0 65.842291
# 12 2018-06-16 pandas 639.180696 NaN NaN
# 13 2018-05-21 pandas 565.432415 26.0 21.747401
# 14 2018-04-22 pandas 145.232115 29.0 5.008004
# 15 2018-04-13 pandas 379.964557 9.0 42.218284
# 16 2018-04-12 pandas 538.168690 1.0 538.168690
# 17 2018-03-20 pandas 783.572993 23.0 34.068391
# 18 2018-03-14 pandas 618.354489 6.0 103.059081
# 19 2018-02-10 pandas 859.278127 32.0 26.852441

Python Pandas Simple Moving Average (deprecated pd.rolling_mean) [duplicate]

I would like to add a moving average calculation to my exchange time series.
Original data from Quandl
Exchange = Quandl.get("BUNDESBANK/BBEX3_D_SEK_USD_CA_AC_000",
authtoken="xxxxxxx")
# Value
# Date
# 1989-01-02 6.10500
# 1989-01-03 6.07500
# 1989-01-04 6.10750
# 1989-01-05 6.15250
# 1989-01-09 6.25500
# 1989-01-10 6.24250
# 1989-01-11 6.26250
# 1989-01-12 6.23250
# 1989-01-13 6.27750
# 1989-01-16 6.31250
# Calculating Moving Avarage
MovingAverage = pd.rolling_mean(Exchange,5)
# Value
# Date
# 1989-01-02 NaN
# 1989-01-03 NaN
# 1989-01-04 NaN
# 1989-01-05 NaN
# 1989-01-09 6.13900
# 1989-01-10 6.16650
# 1989-01-11 6.20400
# 1989-01-12 6.22900
# 1989-01-13 6.25400
# 1989-01-16 6.26550
I would like to add the calculated Moving Average as a new column to the right after Value using the same index (Date). Preferably I would also like to rename the calculated moving average to MA.
The rolling mean returns a Series you only have to add it as a new column of your DataFrame (MA) as described below.
For information, the rolling_mean function has been deprecated in pandas newer versions. I have used the new method in my example, see below a quote from the pandas documentation.
Warning Prior to version 0.18.0, pd.rolling_*, pd.expanding_*, and pd.ewm* were module level functions and are now deprecated. These are replaced by using the Rolling, Expanding and EWM. objects and a corresponding method call.
df['MA'] = df.rolling(window=5).mean()
print(df)
# Value MA
# Date
# 1989-01-02 6.11 NaN
# 1989-01-03 6.08 NaN
# 1989-01-04 6.11 NaN
# 1989-01-05 6.15 NaN
# 1989-01-09 6.25 6.14
# 1989-01-10 6.24 6.17
# 1989-01-11 6.26 6.20
# 1989-01-12 6.23 6.23
# 1989-01-13 6.28 6.25
# 1989-01-16 6.31 6.27
A moving average can also be calculated and visualized directly in a line chart by using the following code:
Example using stock price data:
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import datetime
plt.style.use('ggplot')
# Input variables
start = datetime.datetime(2016, 1, 01)
end = datetime.datetime(2018, 3, 29)
stock = 'WFC'
# Extrating data
df = web.DataReader(stock,'morningstar', start, end)
df = df['Close']
print df
plt.plot(df['WFC'],label= 'Close')
plt.plot(df['WFC'].rolling(9).mean(),label= 'MA 9 days')
plt.plot(df['WFC'].rolling(21).mean(),label= 'MA 21 days')
plt.legend(loc='best')
plt.title('Wells Fargo\nClose and Moving Averages')
plt.show()
Tutorial on how to do this: https://youtu.be/XWAPpyF62Vg
In case you are calculating more than one moving average:
for i in range(2,10):
df['MA{}'.format(i)] = df.rolling(window=i).mean()
Then you can do an aggregate average of all the MA
df[[f for f in list(df) if "MA" in f]].mean(axis=1)
To get the moving average in pandas we can use cum_sum and then divide by count.
Here is the working example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': range(5),
'value': range(100,600,100)})
# some other similar statistics
df['cum_sum'] = df['value'].cumsum()
df['count'] = range(1,len(df['value'])+1)
df['mov_avg'] = df['cum_sum'] / df['count']
# other statistics
df['rolling_mean2'] = df['value'].rolling(window=2).mean()
print(df)
output
id value cum_sum count mov_avg rolling_mean2
0 0 100 100 1 100.0 NaN
1 1 200 300 2 150.0 150.0
2 2 300 600 3 200.0 250.0
3 3 400 1000 4 250.0 350.0
4 4 500 1500 5 300.0 450.0

New column in pandas DataFrame conditional on value of other columns

I have the following pandas DataFrame:
df = pd.DataFrame({'country' : ['US','FR','DE','SP'],
'energy_per_capita': [10,8,9,7],
'pop_2014' : [300,70,80,60],
'pop_2015': [305,72,80,'NaN']})
I'd like to create a new column:
df['total energy consumption']
which multiplies energy_per_capita and pop.
I'd like it to take pop_2015 when available and pop_2014 if pop_2015 == NaN
thanks
make sure you read 10 Minutes to pandas. For this case we are using pandas.DataFrame.fillna method
df = pd.DataFrame({'country' : ['US','FR','DE','SP'],
'energy_per_capita': [10,8,9,7],
'pop_2014' : [300,70,80,60],
'pop_2015': [305,72,80,np.nan]})
df['total energy consumption']= df['energy_per_capita'] *df['pop_2015'].fillna(df['pop_2014'])
print df
output
country energy_per_capita pop_2014 pop_2015 total energy consumption
0 US 10 300 305.0 3050.0
1 FR 8 70 72.0 576.0
2 DE 9 80 80.0 720.0
3 SP 7 60 NaN 420.0

Categories