Calculate weighted average of a data subset with one column as weights - python

I have a huge data set and I need to calculate a weighted average for subsets. My approach works but is painfully slow (dozens of minutes on my data) and I would like to find a faster solution. I thought to use DataFrame.agg but I cannot find out how to put a second column as the weight.
So far, I build a new DataFrame, an aggregated dataframe, with .apply, see below
def weighted(df):
grouped = df.groupby('spot')
result = pd.DataFrame()
for column in df.columns:
result[column] = grouped.apply(weighted_average, column, 'weight')
return result
def weighted_average(df, target_column, weight_column):
"""The average of a target column is calculated with a second column as weight.
Input:
DataFrame
string for column to average
string for column as weight
Output:
float, weighted average of target column """
return sum(df[weight_column] * df[target_column]) / df[weight_column].sum()
Thanks for your attention & help.
Update: I gained a factor 100 timed with 1/250 of my data by doing the following:
def tot_weighting(df, weight_column, list_of_columns):
norm = df.groupby('spot')[weight_column].transform('sum')
for column in list_of_columns:
df[column] = df[weight_column] * df[column]/norm
result = df.groupby('spot').agg('sum')
return result

Related

Filter rows of one dataframe based on string similarity with another dataframe

I have two data frames of different lengths, let's say A(has more number of rows than B) and B. In both of the data frames, one column is the same ('col1') and the column have string values. The goal is to use Levenshtein distance and if the Levenstein distance is more than a certain threshold then I have to take the rows from A and create a new data frame.
Shape of A is 31K x 4
Shape of B is 5K x 9
This is the following code I am using to create the new dataframe
import pandas as pd
import textdistance as td
B_col1_unique_values = list(B['col1'].unique())
new_A_data = []
def compare_vnum(x):
for idx, vnum in enumerate(B_col1_unique_values):
if (td.levenshtein.normalized_similarity(str(vnum), str(x)) > 0.90) and (td.jaro_winkler(str(vnum), str(x)) > 0.95):
B_code = set(B.loc[B['col1']==vnum, 'Code'].tolist())
A_data = A.loc[(A['Code'].isin(B_code)) & (A['col1'] == x)]
new_A_data.extend(A_data.values.tolist())
_ = pd.Series(A['col1'].unique()).apply(compare_vnum)
Any efficient way I can use to reduce the time of execution?

Pythonic Way of Reducing Factor Levels in Large Dataframe

I am attempting to reduce the number of factor levels within a column in a pandas dataframe such that the total instances of any factor as a proportion of all column rows lower than a defined threshold (default set to 1%), will be bucketed into a new factor labeled 'Other'. Below is the function I am using to accomplish this task:
def condenseMe(df, column_name, threshold = 0.01, newLabel = "Other"):
valDict = dict(df[column_name].value_counts() / len(df[column_name]))
toCondense = [v for v in valDict.keys() if valDict[v] < threshold]
if 'Missing' in toCondense:
toCondense.remove('Missing')
df[column_name] = df[column_name].apply(lambda x: newLabel if x in toCondense else x)
The issue I am running into is I am working with a large dataset (~18 million rows) and am attempting to use this function on a column with more than 10,000 levels. Because of this, executing this function on this column is taking a very long time to complete. Is there a more pythonic way to reduce the number of factor levels that will execute faster? Any help would be much appreciated!
You can do this with a combination of groupby, tranform, and count:
def condenseMe(df, col, threshold = 0.01, newLabel="Other"):
# Create a new Series with the normalized value counts
counts = df[[col]].groupby(col)[col].transform('count') / len(df)
# Create a 1D mask based on threshold (ignoring "Missing")
mask = (counts < threshold) & (df[col] != 'Missing')
# Assign these masked values a new label
df[col][mask] = newLabel

Pandas: custom WMAPE function aggregation function to multiple columns without for-loop?

Objective: group pandas dataframe using a custom WMAPE (Weighted Mean Absolute Percent Error) function on multiple forecast columns and one actual data column, without for-loop. I know a for-loop & merges of output dataframes will do the trick. I want to do this efficiently.
Have: WMAPE function, successful use of WMAPE function on one forecast column of dataframe. One column of actual data, variable number of forecast columns.
Input Data: Pandas DataFrame with several categorical columns (City, Person, DT, HOUR), one actual data column (Actual), and four forecast columns (Forecast_1 ... Forecast_4). See link for csv:
https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1
Need: WMAPE function applied during groupby on multiple columns with a list of forecast columns fed into groupby line.
Output Desired: An output dataframe with categorical groups columns and all columns of WMAPE. Labeling is preferred but not needed (output image below).
Successful Code so far:
Two WMAPE functions: one to take two series in & output a single float value (wmape), and one structured for use in a groupby (wmape_gr):
def wmape(actual, forecast):
# we take two series and calculate an output a wmape from it
# make a series called mape
se_mape = abs(actual-forecast)/actual
# get a float of the sum of the actual
ft_actual_sum = actual.sum()
# get a series of the multiple of the actual & the mape
se_actual_prod_mape = actual * se_mape
# summate the prod of the actual and the mape
ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
# float: wmape of forecast
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
# return a float
return ft_wmape_forecast
def wmape_gr(df_in, st_actual, st_forecast):
# we take two series and calculate an output a wmape from it
# make a series called mape
se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]
# get a float of the sum of the actual
ft_actual_sum = df_in[st_actual].sum()
# get a series of the multiple of the actual & the mape
se_actual_prod_mape = df_in[st_actual] * se_mape
# summate the prod of the actual and the mape
ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
# float: wmape of forecast
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
# return a float
return ft_wmape_forecast
# read in data directly from Dropbox
df = pd.read_csv('https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1',sep=",",header=0)
# grouping with 3 columns. wmape_gr uses the Actual column, and Forecast_1 as inputs
df_gr = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_1')
Output Looks Like (first two rows):
Desired output would have all forecasts in one shot (dummy data for Forecast_2 ... Forecast_4). I can already do this with a for-loop. I just want to do it within the groupby. I want to call a wmape function four times. I would appreciate any assistance.
This is a really good problem to show how to optimize a groupby.apply in pandas. There are two principles that I use to help with these problems.
Any calculation that is independent of the group should not be done within a groupby
If there is a built-in groupby method, use it first before using
apply
Let's go line by line through your wmape_gr function.
se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]
This line is completely independent of any group. You should do this calculation outside of the apply. Below I do this for each of the forecast columns:
df['actual_forecast_diff_1'] = (df['Actual'] - df['Forecast_1']).abs() / df['Actual']
df['actual_forecast_diff_2'] = (df['Actual'] - df['Forecast_2']).abs() / df['Actual']
df['actual_forecast_diff_3'] = (df['Actual'] - df['Forecast_3']).abs() / df['Actual']
df['actual_forecast_diff_4'] = (df['Actual'] - df['Forecast_4']).abs() / df['Actual']
Let's take a look at the next line:
ft_actual_sum = df_in[st_actual].sum()
This line is dependent on the group so we must use a groupby here, but it isn't necessary to place this within the apply function. It will be calculated later on below.
Let's move to the next line:
se_actual_prod_mape = df_in[st_actual] * se_mape
This again is independent of the group. Let's calculate it on the DataFrame as a whole.
df['forecast1_wampe'] = df['actual_forecast_diff_1'] * df['Actual']
df['forecast2_wampe'] = df['actual_forecast_diff_2'] * df['Actual']
df['forecast3_wampe'] = df['actual_forecast_diff_3'] * df['Actual']
df['forecast4_wampe'] = df['actual_forecast_diff_4'] * df['Actual']
Let's move on to the last two lines:
ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
These lines again are dependent on the group, but we still don't need to use apply. We now have each of the 4 'forecast_wampe' columns calcaulted independent of the group. We simply need to sum each one per group. The same goes for the 'Actual' column.
We can run two separate groupby operations to sum each of these columns like this:
g = df.groupby(['City', 'Person', 'DT'])
actual_sum = g['Actual'].sum()
forecast_wampe_cols = ['forecast1_wampe', 'forecast2_wampe', 'forecast3_wampe', 'forecast4_wampe']
forecast1_wampe_sum = g[forecast_wampe_cols].sum()
We get the following Series and DataFrame returned
Then we just need to divide each of the columns in the DataFrame by the Series. We'll need to use the div method to change the orientation of the division so that the indexes align
forecast1_wampe_sum.div(actual_sum, axis='index')
And this returns our answer:
If you modify wmape to work with arrays using broadcasting, then you can do it in one shot:
def wmape(actual, forecast):
# Take a series (actual) and a dataframe (forecast) and calculate wmape
# for each forecast. Output shape is (1, num_forecasts)
# Convert to numpy arrays for broadasting
forecast = np.array(forecast.values)
actual=np.array(actual.values).reshape((-1, 1))
# Make an array of mape (same shape as forecast)
se_mape = abs(actual-forecast)/actual
# Calculate sum of actual values
ft_actual_sum = actual.sum(axis=0)
# Multiply the actual values by the mape
se_actual_prod_mape = actual * se_mape
# Take the sum of the product of actual values and mape
# Make sure to sum down the rows (1 for each column)
ft_actual_prod_mape_sum = se_actual_prod_mape.sum(axis=0)
# Calculate the wmape for each forecast and return as a dictionary
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
return {f'Forecast_{i+1}_wmape': wmape for i, wmape in enumerate(ft_wmape_forecast)}
Then use apply on the proper columns:
# Group the dataframe and apply the function to appropriate columns
new_df = df.groupby(['City', 'Person', 'DT']).apply(lambda x: wmape(x['Actual'],
x[[c for c in x if 'Forecast' in c]])).\
to_frame().reset_index()
This results in a dataframe with a single dictionary column.
The single column can be converted to multiple columns for the correct format:
# Convert the dictionary in a single column into 4 columns with proper names
# and concantenate column-wise
df_grp = pd.concat([new_df.drop(columns=[0]),
pd.DataFrame(list(new_df[0].values))], axis=1)
Result:
without changing the functions
applying four times
df_gr1 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_1')
df_gr2 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_2')
df_gr3 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_3')
df_gr4 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_4')
join them together
all1= pd.concat([df_gr1, df_gr2,df_gr3,df_gr4],axis=1, sort=False)
get the columns for city,person and DT
all1['city']= [all1.index[i][0] for i in range(len(df_gr1))]
all1['Person']= [all1.index[i][1] for i in range(len(df_gr1))]
all1['DT']= [all1.index[i][2] for i in range(len(df_gr1))]
rename the columns and change order
df = all1.rename(columns={0:'Forecast_1_wmape', 1:'Forecast_2_wmape',2:'Forecast_3_wmape',3:'Forecast_4_wmape'})
df = df[['city','Person','DT','Forecast_1_wmape','Forecast_2_wmape','Forecast_3_wmape','Forecast_4_wmape']]
df=df.reset_index(drop=True)

Generate numpy array with duplicate rate

Here is my problem: i have to generate some synthetic data (like 7/8 columns), correlated each other (using pearson coefficient). I can do this easily, but next i have to insert a percentage of duplicates in each column (yes, pearson coefficent will be lower), different for each column.
The problem is that i don't want to insert personally that duplicates, cause in my case it would be like to cheat.
Someone knows how to generate correlated data already duplicates ? I've searched but usually questions are about drop or avoid duplicates..
Language: python3
To generate correlated data i'm using this simple code: Generatin correlated data
Try something like this :
indices = np.random.randint(0, array.shape[0], size = int(np.ceil(percentage * array.shape[0])))
for index in indices:
array.append(array[index])
Here I make the assumption that your data is stored in array which is an ndarray, where each row contains your 7/8 columns of data.
The above code should create an array of random indices, whose entries (rows) you select and append again to the array.
i find out the solution.
I post the code, it might be helpful for someone.
#this are the data, generated randomically with a given shape
rnd = np.random.random(size=(10**7, 8))
#that array represent a column of the covariance matrix (i want correlated data, so i randomically choose a number between 0.8 and 0.95)
#I added other 7 columns, with varing range of values (all upper than 0.7)
attr1 = np.random.uniform(0.8, .95, size = (8,1))
#attr2,3,4,5,6,7 like attr1
#corr_mat is the matrix, union of columns
corr_mat = np.column_stack((attr1,attr2,attr3,attr4,attr5, attr6,attr7,attr8))
from statsmodels.stats.correlation_tools import cov_nearest
#using that function i found the nearest covariance matrix to my matrix,
#to be sure that it's positive definite
a = cov_nearest(corr_mat)
from scipy.linalg import cholesky
upper_chol = cholesky(a)
# Finally, compute the inner product of upper_chol and rnd
ans = rnd # upper_chol
#ans now has randomically correlated data (high correlation, but is customizable)
#next i create a pandas Dataframe with ans values
df = pd.DataFrame(ans, columns=['att1', 'att2', 'att3', 'att4',
'att5', 'att6', 'att7', 'att8'])
#last step is to truncate float values of ans in a variable way, so i got
#duplicates in varying percentage
a = df.values
for i in range(8):
trunc = np.random.randint(5,12)
print(trunc)
a.T[i] = a.T[i].round(decimals=trunc)
#float values of ans have 16 decimals, so i randomically choose an int
# between 5 and 12 and i use it to truncate each value
Finally, those are my duplicates percentages for each column:
duplicate rate attribute: att1 = 5.159390000000002
duplicate rate attribute: att2 = 11.852260000000001
duplicate rate attribute: att3 = 12.036079999999998
duplicate rate attribute: att4 = 35.10611
duplicate rate attribute: att5 = 4.6471599999999995
duplicate rate attribute: att6 = 35.46553
duplicate rate attribute: att7 = 0.49115000000000464
duplicate rate attribute: att8 = 37.33252

Iterating over entire columns and storing the result into a list

I would like to know how I could iterate through each columns of a dataframe to perform some calculations and store the result in an another dataframe.
df_empty = []
m = daily.ix[:,-1] #Columns= stocks & Rows= daily returns
stocks = daily.ix[:,:-1]
for col in range (len(stocks.columns)):
s = daily.ix[:,col]
covmat = np.cov(s,m)
beta = covmat[0,1]/covmat[1,1]
return (beta)
print(beta)
In the above example, I first want to calculate a covariance matrix between "s" (the columns representing stocks daily returns and for which I want to iterate through one by one) and "m" (the market daily return which is my reference column/the last column of my dataframe). Then I want to calculate the beta for each covariance pair stock/market.
I'm not sure why return(beta) give me a single numerical result for one stock while print(beta) print the beta for all stocks.
I'd like to find a way to create a dataframe with all these betas.
beta_df = df_empty.append(beta)
I have tried the above code but it returns 'none' as if it could not append the outcome.
Thank you for your help
The return statement within your for-loop ends the loop itself the first time the return is encountered. Moreover, you are not saving the beta value anywhere because the for-loop itself does not return a value in python (it only has side effects).
Apart from that, you may choose a more pandas-like approach using apply on the data frame which basically iterates over the columns of the data frame and passes each column to a supplied function as the first parameter while returning the result of the function call. Here is a minimal working example with some dummy data:
import pandas as pd
import numpy as pd
# create some dummy data
daily = pd.DataFrame(np.random.randint(100, size=(100, 5)))
# define reference column
cov_column = daily.iloc[:, -1]
# setup computation function
def compute(column):
covmat = np.cov(column, cov_column)
return covmat[0,1]/covmat[1,1]
# use apply to iterate over columns
result = daily.iloc[:, :-1].apply(compute)
# show output
print(result)
0 -0.125382
1 0.024777
2 0.011324
3 -0.017622
dtype: float64

Categories