I have a simplified data frame called df
import pandas as pd
df = pd.DataFrame({'num': [1,1,2,2],
'price': [12,11,15,13],
'y': [7,7,9,9]})
I want to group by num an then multiply price and y and take the sum divided by sum of y
I've been trying to get started with this and have been having trouble
df.groupby('letter').agg(['price']*['quantity'])
Prior to the groupby operation, you can add a temporary column to the dataframe that calcs your intermediate result (price * y) and then use this column in your groupby operation (summing the values, and then using eval to calculate the sum of temp divided by the sum of y). Cast the result back to a dataframe and call the new column whatever you'd like.
>>> (df
.assign(temp=df.eval('price * y'))
.groupby('num')
.sum()
.eval('temp / y')
.to_frame('result')
)
result
num
1 11.5
2 14.0
Basically you want to compute a weighted mean
A way to do this is:
import numpy as np
# define custom function with 'y'column as weights
weights = lambda x: np.average(x,weights=df.loc[x.index,'y'])
# aggregate using this new function
df.groupby('num').agg({'price': weights})
Related
Being not very experimented with python programming, I have to trigger a python function in my ETL
Goal: Run a RATE function on every row of a table containing financial data.
Here is the Function I have to use
import numpy as np
Solution = np.rate(nper, pmt, pv, fv,when=1,guess=0.01,tol=1e-06, maxiter=100)
and as argument, I'd like to use data coming from a table stored in a Pandas DataFrame
In this case I presenting only one row for exemple purposes, but m table will most likely contains many rows
Table:
nper
pmt
pv
fv
56
281
-22057
9365
Does anyone know how to structure the function?
Thanks for reading
np.rate uses array-like inputs, so you can simply pass the columns as arguments and will receive an array of rates:
df = pd.DataFrame.from_dict({'nper': {0: 56}, 'pmt': {0: 281}, 'pv': {0: -22057}, 'fv': {0: 9365}})
np.rate(df.nper, df.pmt, df.pv, df.fv, when=1, guess=0.01, tol=1e-06, maxiter=100)
This uses the values nper, pmt, pv, fv of each row in the dataframe df in a vectorized manner. If df hasn rows, the function will return an array of length n corresponding to the rate of each row.
To store the resulting rates in the dataframe df, you can assign in to a new column:
df['rate'] = np.rate(df.nper, df.pmt, df.pv, df.fv, when=1, guess=0.01, tol=1e-06, maxiter=100)
Note that np.rate is deprecated, so you should consider using the corresponding function in the numpy-financial library, https://pypi.org/project/numpy-financial.
Something you can do is creating a separate function for the rate rate function.
You can then use df.apply with axis 1, which will pass every row into this function. You can get the values out from each row and put it into your numpy.rate function.
Right now in the end you get a new column called Solution where you have what the rate function computed:
import pandas as pd
import numpy as np
data = {
"nper": [1,2,3],
"pmt": [4,5,6],
"pv": [7,8,9],
"fv": [0,1,2]
}
df = pd.DataFrame(data)
def RunRateFunction(row):
x = np.rate(row.nper, row.pmt, row.pv, row.fv,when=1,guess=0.01,tol=1e-06, maxiter=100)
return x
df["Solution"] = df.apply(RunRateFunction, axis = 1)
Output:
nper pmt pv fv Solution
0 1 4 7 0 -1.000000
1 2 5 8 1 NaN
2 3 6 9 2 -1.348887
market['AAPL'] is a dataframe with Apple's daily stock return
I noticed that:
market['AAPL'].apply(lambda x: np.exp(x))
market['AAPL'].apply(lambda x: np.cumprod(np.exp(x)))
Both give the same results
Why is the np.cumprod not working?
You probably mean to apply the cumulative product across the AAPL column. Your current attempt doesn't work, because .apply works per row. As a result, np.cumprod is called each time for a single number, not for an array of numbers.
Instead, try something like this:
import pandas as pd
import numpy as np
aapl = {"AAPL": np.linspace(1, 2, 10)}
df = pd.DataFrame(appl)
# Calculate exp for the column, then calculate
# the cumulative product over the column
df['cum-AAPL'] = np.exp(df['AAPL']).cumprod())
Because x is a number, it's np.exp is a number, and a product of one number is itself.
Objective: group pandas dataframe using a custom WMAPE (Weighted Mean Absolute Percent Error) function on multiple forecast columns and one actual data column, without for-loop. I know a for-loop & merges of output dataframes will do the trick. I want to do this efficiently.
Have: WMAPE function, successful use of WMAPE function on one forecast column of dataframe. One column of actual data, variable number of forecast columns.
Input Data: Pandas DataFrame with several categorical columns (City, Person, DT, HOUR), one actual data column (Actual), and four forecast columns (Forecast_1 ... Forecast_4). See link for csv:
https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1
Need: WMAPE function applied during groupby on multiple columns with a list of forecast columns fed into groupby line.
Output Desired: An output dataframe with categorical groups columns and all columns of WMAPE. Labeling is preferred but not needed (output image below).
Successful Code so far:
Two WMAPE functions: one to take two series in & output a single float value (wmape), and one structured for use in a groupby (wmape_gr):
def wmape(actual, forecast):
# we take two series and calculate an output a wmape from it
# make a series called mape
se_mape = abs(actual-forecast)/actual
# get a float of the sum of the actual
ft_actual_sum = actual.sum()
# get a series of the multiple of the actual & the mape
se_actual_prod_mape = actual * se_mape
# summate the prod of the actual and the mape
ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
# float: wmape of forecast
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
# return a float
return ft_wmape_forecast
def wmape_gr(df_in, st_actual, st_forecast):
# we take two series and calculate an output a wmape from it
# make a series called mape
se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]
# get a float of the sum of the actual
ft_actual_sum = df_in[st_actual].sum()
# get a series of the multiple of the actual & the mape
se_actual_prod_mape = df_in[st_actual] * se_mape
# summate the prod of the actual and the mape
ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
# float: wmape of forecast
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
# return a float
return ft_wmape_forecast
# read in data directly from Dropbox
df = pd.read_csv('https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1',sep=",",header=0)
# grouping with 3 columns. wmape_gr uses the Actual column, and Forecast_1 as inputs
df_gr = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_1')
Output Looks Like (first two rows):
Desired output would have all forecasts in one shot (dummy data for Forecast_2 ... Forecast_4). I can already do this with a for-loop. I just want to do it within the groupby. I want to call a wmape function four times. I would appreciate any assistance.
This is a really good problem to show how to optimize a groupby.apply in pandas. There are two principles that I use to help with these problems.
Any calculation that is independent of the group should not be done within a groupby
If there is a built-in groupby method, use it first before using
apply
Let's go line by line through your wmape_gr function.
se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]
This line is completely independent of any group. You should do this calculation outside of the apply. Below I do this for each of the forecast columns:
df['actual_forecast_diff_1'] = (df['Actual'] - df['Forecast_1']).abs() / df['Actual']
df['actual_forecast_diff_2'] = (df['Actual'] - df['Forecast_2']).abs() / df['Actual']
df['actual_forecast_diff_3'] = (df['Actual'] - df['Forecast_3']).abs() / df['Actual']
df['actual_forecast_diff_4'] = (df['Actual'] - df['Forecast_4']).abs() / df['Actual']
Let's take a look at the next line:
ft_actual_sum = df_in[st_actual].sum()
This line is dependent on the group so we must use a groupby here, but it isn't necessary to place this within the apply function. It will be calculated later on below.
Let's move to the next line:
se_actual_prod_mape = df_in[st_actual] * se_mape
This again is independent of the group. Let's calculate it on the DataFrame as a whole.
df['forecast1_wampe'] = df['actual_forecast_diff_1'] * df['Actual']
df['forecast2_wampe'] = df['actual_forecast_diff_2'] * df['Actual']
df['forecast3_wampe'] = df['actual_forecast_diff_3'] * df['Actual']
df['forecast4_wampe'] = df['actual_forecast_diff_4'] * df['Actual']
Let's move on to the last two lines:
ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
These lines again are dependent on the group, but we still don't need to use apply. We now have each of the 4 'forecast_wampe' columns calcaulted independent of the group. We simply need to sum each one per group. The same goes for the 'Actual' column.
We can run two separate groupby operations to sum each of these columns like this:
g = df.groupby(['City', 'Person', 'DT'])
actual_sum = g['Actual'].sum()
forecast_wampe_cols = ['forecast1_wampe', 'forecast2_wampe', 'forecast3_wampe', 'forecast4_wampe']
forecast1_wampe_sum = g[forecast_wampe_cols].sum()
We get the following Series and DataFrame returned
Then we just need to divide each of the columns in the DataFrame by the Series. We'll need to use the div method to change the orientation of the division so that the indexes align
forecast1_wampe_sum.div(actual_sum, axis='index')
And this returns our answer:
If you modify wmape to work with arrays using broadcasting, then you can do it in one shot:
def wmape(actual, forecast):
# Take a series (actual) and a dataframe (forecast) and calculate wmape
# for each forecast. Output shape is (1, num_forecasts)
# Convert to numpy arrays for broadasting
forecast = np.array(forecast.values)
actual=np.array(actual.values).reshape((-1, 1))
# Make an array of mape (same shape as forecast)
se_mape = abs(actual-forecast)/actual
# Calculate sum of actual values
ft_actual_sum = actual.sum(axis=0)
# Multiply the actual values by the mape
se_actual_prod_mape = actual * se_mape
# Take the sum of the product of actual values and mape
# Make sure to sum down the rows (1 for each column)
ft_actual_prod_mape_sum = se_actual_prod_mape.sum(axis=0)
# Calculate the wmape for each forecast and return as a dictionary
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
return {f'Forecast_{i+1}_wmape': wmape for i, wmape in enumerate(ft_wmape_forecast)}
Then use apply on the proper columns:
# Group the dataframe and apply the function to appropriate columns
new_df = df.groupby(['City', 'Person', 'DT']).apply(lambda x: wmape(x['Actual'],
x[[c for c in x if 'Forecast' in c]])).\
to_frame().reset_index()
This results in a dataframe with a single dictionary column.
The single column can be converted to multiple columns for the correct format:
# Convert the dictionary in a single column into 4 columns with proper names
# and concantenate column-wise
df_grp = pd.concat([new_df.drop(columns=[0]),
pd.DataFrame(list(new_df[0].values))], axis=1)
Result:
without changing the functions
applying four times
df_gr1 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_1')
df_gr2 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_2')
df_gr3 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_3')
df_gr4 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_4')
join them together
all1= pd.concat([df_gr1, df_gr2,df_gr3,df_gr4],axis=1, sort=False)
get the columns for city,person and DT
all1['city']= [all1.index[i][0] for i in range(len(df_gr1))]
all1['Person']= [all1.index[i][1] for i in range(len(df_gr1))]
all1['DT']= [all1.index[i][2] for i in range(len(df_gr1))]
rename the columns and change order
df = all1.rename(columns={0:'Forecast_1_wmape', 1:'Forecast_2_wmape',2:'Forecast_3_wmape',3:'Forecast_4_wmape'})
df = df[['city','Person','DT','Forecast_1_wmape','Forecast_2_wmape','Forecast_3_wmape','Forecast_4_wmape']]
df=df.reset_index(drop=True)
I have a DataFrame with aprox. 4 columns and 200 rows. I created a 5th column with null values:
df['minutes'] = np.nan
Then, I want to fill each row of this new column with random inverse log normal values. The code to generate 1 inverse log normal:
note: if the code bellow is ran multiple times it will generate a new result because of the value inside ppf() : random.random()
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
What's happening when I do that is that it's filling all 200 rows of df['minutes'] with the same number, instead of triggering the random.random() for each row as I expected it to.
What do I have to do? I tried using for loopbut apparently I'm not getting it right (giving the same results):
for i in range(1,len(df)):
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
what am I doing wrong?
Also, I'll add that later I'll need to change some parameters of the inverse log normal above if the value of another column is 0 or 1. as in:
if df['type'] == 0:
df['minutes'] = df['minutes'].fillna(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
elif df['type'] == 1:
df['minutes'] = df['minutes'].fillna(stats.lognorm(1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int))
thanks in advance.
The problem with your use of fillna here is that this function takes a value as argument and applies it to every element along the specified axis. So your stat value is calculated once and then distributed into every row.
What you need is your function called for every element on the axis, so your argument must be the function itself and not a value. That's a job for apply, which takes a function and applies it on elements along an axis.
I'm straight jumping to your final requirements:
You could use apply just on the minutes-column (as a pandas.Series method) with a lambda-function and then assign the respective results to the type-column filtered rows of column minutes:
import numpy as np
import pandas as pd
import scipy.stats as stats
import random
# setup
df = pd.DataFrame(np.random.randint(0, 2, size=(8, 4)),
columns=list('ABC') + ['type'])
df['minutes'] = np.nan
df.loc[df.type == 0, 'minutes'] = \
df['minutes'].apply(lambda _: stats.lognorm(
0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int),
convert_dtype=False))
df.loc[df.type == 1, 'minutes'] = \
df['minutes'].apply(lambda _: stats.lognorm(
1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int),
convert_dtype=False))
... or you use apply as a DataFrame method with a function wrapping your logic to distinguish between values of type-column and assign the result back to the minutes-column:
def calc_minutes(row):
if row['type'] == 0:
return stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int)
elif row['type'] == 1:
return stats.lognorm(1.2, scale=np.exp(2.7)).ppf(random.random()).astype(int)
df['minutes'] = df.apply(calc_minutes, axis=1)
Managed to do it with some steps with a different mindset:
Created 2 lists, each with i's own parameters
Used NumPy's append
so that for each row a different random number
lognormal_tone = []
lognormal_ttwo = []
for i in range(len(s)):
lognormal_tone.append(stats.lognorm(0.5, scale=np.exp(1.8)).ppf(random.random()).astype(int))
lognormal_ttwo.append(stats.lognorm(0.4, scale=np.exp(2.7)).ppf(random.random()).astype(int))
Then, included them in the DataFrame with another previously created list:
df = pd.DataFrame({'arrival':arrival,'minTypeOne':lognormal_tone, 'minTypeTwo':lognormal_two})
I want to create a new column comp in a pandas dataframe containing a single column price. The value of this new column should be generated by a function that works on the current and last 3 values of the price.
df.apply() works off a single row, shift() doesnt seem to work. Do experts have any suggestion to make it work in a vectorized operation?
Use a series sum group.apply() function. Below assumes you have an index or column named ID of increasing row values 1, 2, 3, ... that can be used to count back 3 values.
# SERIES SUM FUNCTION
def intsum(x):
if x < 3:
ser = df.price[(df.ID < x)]
else:
ser = df.price[(df.ID >= x - 3) & (df.ID < x)]
return ser.sum()
# APPLY FUNCTION
df['comp'] = df['ID'].apply(intsum)