How to obtain boostrap estimates for multiple parameters - python

I want to create a function using boostrap to estimates mean and median on multiple parameters. I have been able to generate for a single parameter, but the multiple seems I would need to get a loop that will be able to loop through and return the mean and meadian for me. The code is for single;
def get_boostrap_estimates_single_param(sample_size):
"""
Generate a bootstrap sample but harcode df within the function
so that we can use Pool.map() later
:param sample_size: how many rows to sample from the input dataframe
:returns mean and median of this particular bootstrap sample
"""
# Load the age.csv file into a dataframe here
dataPath = '/home/faustus/Documents/Review Courses/Big Data Analysis/age.csv'
age = pd.read_csv(dataPath, index_col = False)
# shuffle the dataframe by using pd.DataFrame.sample() with
# n = df.size
df = pd.DataFrame.sample(age, n = age.size)
# Get bootstrap sample with replacement
df_sample = df.sample(sample_size, replace = True)
# return mean and median
return df_sample.mean(), df_sample.median()
Output:
get_boostrap_estimates_single_param(2000)
(age 23.057
dtype: float64,
age 19.0
dtype: float64)
Now I want to get for multiple parameters, but instead of hard-coding the dataframe, I want to pass it as an argument using the fuction;
def get_boostrap_estimates_multiple_params(df, sample_size):
"""
Generate a bootstrap sample
:param df: input dataframe
:param sample_size: how many rows to sample from the input dataframe
:returns mean and median of this particular bootstrap sample
"""
How can I solve this problem?

Related

Ambiguous behaviour of pandas aggregate function

So I was using pandas for some analysis and ran into a scenario, in which I had to run 2 functions for different groups in my data. And I decided to use pandas' .agg function.
Below are my 2 functions:
def pick_pacf(df,alpha=0.05,nlags=192):
'''
This function returns the lags in the timeseries which are highly correlated with the original timeseries
Input
1. df: pandas series, this is the column for which we are trying to find AR lag
2. metric: str, what metric to be calculated - acf/pacf
3. alpha: float, confidence interval
4. nlags: int, the no. of lags to be tested
Return
1. lags: list, this contain the list of all the lags (# of timestamps) that are highly correlated
'''
values,conf_int = pacf(df.values,alpha=alpha,nlags=nlags)
lags = []
#in the pacf function, confidence interval is centered around pacf values
#we need them to be centered around 0, this will produce the intervals we see in the graph
conf_int_cntrd = [value[0] - value[1] for value in zip(conf_int,values)]
for obs_index, obs in enumerate(zip(conf_int_cntrd,values)):
if obs[1] >= obs[0][1]: #obs[0][1] contains the high value of the conf int
lags.append(obs_index)
elif obs[1] <= obs[0][0]: #obs[0][0] contains the low value of the conf_int
lags.append(obs_index)
lags.remove(0) #removing the 0 lag for auto-corr with itself
return lags
def pick_acf(df,nlags=192):
'''
This funciton takes returns the ACF value for a MA model for a time series
Input
1. df: pandas series, this is the series for which we want to find ACF value
2. nlags: the number of lags to be taken into consideration for ACF
Returns
1. q: numpy array, The lags value at which ACF cuts off
'''
acf_values = acf(df.values)
acf_values = np.round(acf_values,1)
q = np.where(acf_values == 0)[0]
return q
No need to go through the functions line by line (you can if you want to) but the main thing to focus here is what the two functions return. pick_pacf returns a list, whereas pick_acf returns a numpy array.
The calls to these functions are like this:
pacf_values = train_ads[['INVERTER_ID','PER_TS_YIELD']].groupby('INVERTER_ID').agg(lambda x: pick_pacf(x))
acf_values = train_ads[['INVERTER_ID','PER_TS_YIELD']].groupby('INVERTER_ID').agg(lambda x: pick_acf(x))
PER_TS_YIELD is a numeric column and INVERTER_ID is an alphanumeric column.
The ambiguous behaviour here is that when I call the pick_pacf function then only the pandas series of PER_TS_YIELD column is sent as input to the function for each INVERTER_ID.
Whereas, when I call the pick_acf function then first the pandas series of PER_TS_YIELD column is sent and then the whole data frame made up of INVERTER_ID and PER_TS_YIELD columns is sent to the function. This leads to an error as I am doing some calculation which error out when an alphanumeric column is received.
Why is this happening? Does the behaviour of the .agg function depends on what is being returned from the user defined function??
Can someone please explain this to me. Thanks in advance.

Splitting / Divide dataframe every 10 rows

I'm trying to split my data frame into 10 rows and find the aggregate function (mean, SD, etc) for each 10 rows then merge it into 1 data frame again. Previously I had grouped the data using .groupby function, but having trouble to split the data into 10 rows.
This is what I have done :
def sorting (df):
grouped = df.groupby(['Activity']).
l_grouped=list(grouped)
I turned the grouped result into list (l_grouped), but I don't know if I could separate the rows from each tuple/list?
The result was indentical with the original data frame, but there were separated by 'Activity'. For example, row that has 'Standing' as the targeted value ('Activity') would be accesible through calling l_grouped[1][0] (type list/tuple). l_grouped [1][1] would return word 'Standing' only.
I could access the grouped result using :
for i in range(len(df_sort)):
print(df_sort[i][1])
df_sort referring to the result of calling the sorting(df)
Is there any way i could split/divide the tuple/list per each rows? Then create the aggregate function out of that?
I would suggest using a window function + a small trick for the stride:
step = 10
window = 10
df = pd.DataFrame({'a': range(100)})
df['a'].rolling(window).sum()[::step]
if this is not the exact result you are looking for check the documentation for more details regarding the bounds of the window and etc. ...
You can do:
import numpy as np
step = 10
df["id"] = np.arange(len(df))//step
gr = df.groupby("id")
...

Python function appended values as list values into my dataframe. How to normalize?

I applied a function to append a list of values to my dataframe using this code:
lat=[]
for i in addresses['whole_address']:
try:
lat.append([locator.geocode(f'{i}').latitude])
except: lat.append('na')
addresses["latitude"]=lat
My output in data frame looks like this and has these [] brackets which are disturbing for more downstream manipulations. How can I normalize these values, so I can calculate distances etc?
addresses['latitude']=
[33.3064318]
11 [33.3064318]
12 [33.3064318]
15 [33.32554963636363]
When trying to calculate distances to a given location, I get this error therefore: must be real number, not list.
I recommend using the apply function.
def df_geocode(row):
try:
lat = [locator.geocode(f'{row['whole_address']}').latitude]
except:
lat = 'na'
return lat
address["Latitude"] = addresses.apply(df_geocode, axis=1)

Pandas: custom WMAPE function aggregation function to multiple columns without for-loop?

Objective: group pandas dataframe using a custom WMAPE (Weighted Mean Absolute Percent Error) function on multiple forecast columns and one actual data column, without for-loop. I know a for-loop & merges of output dataframes will do the trick. I want to do this efficiently.
Have: WMAPE function, successful use of WMAPE function on one forecast column of dataframe. One column of actual data, variable number of forecast columns.
Input Data: Pandas DataFrame with several categorical columns (City, Person, DT, HOUR), one actual data column (Actual), and four forecast columns (Forecast_1 ... Forecast_4). See link for csv:
https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1
Need: WMAPE function applied during groupby on multiple columns with a list of forecast columns fed into groupby line.
Output Desired: An output dataframe with categorical groups columns and all columns of WMAPE. Labeling is preferred but not needed (output image below).
Successful Code so far:
Two WMAPE functions: one to take two series in & output a single float value (wmape), and one structured for use in a groupby (wmape_gr):
def wmape(actual, forecast):
# we take two series and calculate an output a wmape from it
# make a series called mape
se_mape = abs(actual-forecast)/actual
# get a float of the sum of the actual
ft_actual_sum = actual.sum()
# get a series of the multiple of the actual & the mape
se_actual_prod_mape = actual * se_mape
# summate the prod of the actual and the mape
ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
# float: wmape of forecast
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
# return a float
return ft_wmape_forecast
def wmape_gr(df_in, st_actual, st_forecast):
# we take two series and calculate an output a wmape from it
# make a series called mape
se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]
# get a float of the sum of the actual
ft_actual_sum = df_in[st_actual].sum()
# get a series of the multiple of the actual & the mape
se_actual_prod_mape = df_in[st_actual] * se_mape
# summate the prod of the actual and the mape
ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
# float: wmape of forecast
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
# return a float
return ft_wmape_forecast
# read in data directly from Dropbox
df = pd.read_csv('https://www.dropbox.com/s/tidf9lj80a1dtd8/data_small_2.csv?dl=1',sep=",",header=0)
# grouping with 3 columns. wmape_gr uses the Actual column, and Forecast_1 as inputs
df_gr = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_1')
Output Looks Like (first two rows):
Desired output would have all forecasts in one shot (dummy data for Forecast_2 ... Forecast_4). I can already do this with a for-loop. I just want to do it within the groupby. I want to call a wmape function four times. I would appreciate any assistance.
This is a really good problem to show how to optimize a groupby.apply in pandas. There are two principles that I use to help with these problems.
Any calculation that is independent of the group should not be done within a groupby
If there is a built-in groupby method, use it first before using
apply
Let's go line by line through your wmape_gr function.
se_mape = abs(df_in[st_actual] - df_in[st_forecast]) / df_in[st_actual]
This line is completely independent of any group. You should do this calculation outside of the apply. Below I do this for each of the forecast columns:
df['actual_forecast_diff_1'] = (df['Actual'] - df['Forecast_1']).abs() / df['Actual']
df['actual_forecast_diff_2'] = (df['Actual'] - df['Forecast_2']).abs() / df['Actual']
df['actual_forecast_diff_3'] = (df['Actual'] - df['Forecast_3']).abs() / df['Actual']
df['actual_forecast_diff_4'] = (df['Actual'] - df['Forecast_4']).abs() / df['Actual']
Let's take a look at the next line:
ft_actual_sum = df_in[st_actual].sum()
This line is dependent on the group so we must use a groupby here, but it isn't necessary to place this within the apply function. It will be calculated later on below.
Let's move to the next line:
se_actual_prod_mape = df_in[st_actual] * se_mape
This again is independent of the group. Let's calculate it on the DataFrame as a whole.
df['forecast1_wampe'] = df['actual_forecast_diff_1'] * df['Actual']
df['forecast2_wampe'] = df['actual_forecast_diff_2'] * df['Actual']
df['forecast3_wampe'] = df['actual_forecast_diff_3'] * df['Actual']
df['forecast4_wampe'] = df['actual_forecast_diff_4'] * df['Actual']
Let's move on to the last two lines:
ft_actual_prod_mape_sum = se_actual_prod_mape.sum()
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
These lines again are dependent on the group, but we still don't need to use apply. We now have each of the 4 'forecast_wampe' columns calcaulted independent of the group. We simply need to sum each one per group. The same goes for the 'Actual' column.
We can run two separate groupby operations to sum each of these columns like this:
g = df.groupby(['City', 'Person', 'DT'])
actual_sum = g['Actual'].sum()
forecast_wampe_cols = ['forecast1_wampe', 'forecast2_wampe', 'forecast3_wampe', 'forecast4_wampe']
forecast1_wampe_sum = g[forecast_wampe_cols].sum()
We get the following Series and DataFrame returned
Then we just need to divide each of the columns in the DataFrame by the Series. We'll need to use the div method to change the orientation of the division so that the indexes align
forecast1_wampe_sum.div(actual_sum, axis='index')
And this returns our answer:
If you modify wmape to work with arrays using broadcasting, then you can do it in one shot:
def wmape(actual, forecast):
# Take a series (actual) and a dataframe (forecast) and calculate wmape
# for each forecast. Output shape is (1, num_forecasts)
# Convert to numpy arrays for broadasting
forecast = np.array(forecast.values)
actual=np.array(actual.values).reshape((-1, 1))
# Make an array of mape (same shape as forecast)
se_mape = abs(actual-forecast)/actual
# Calculate sum of actual values
ft_actual_sum = actual.sum(axis=0)
# Multiply the actual values by the mape
se_actual_prod_mape = actual * se_mape
# Take the sum of the product of actual values and mape
# Make sure to sum down the rows (1 for each column)
ft_actual_prod_mape_sum = se_actual_prod_mape.sum(axis=0)
# Calculate the wmape for each forecast and return as a dictionary
ft_wmape_forecast = ft_actual_prod_mape_sum / ft_actual_sum
return {f'Forecast_{i+1}_wmape': wmape for i, wmape in enumerate(ft_wmape_forecast)}
Then use apply on the proper columns:
# Group the dataframe and apply the function to appropriate columns
new_df = df.groupby(['City', 'Person', 'DT']).apply(lambda x: wmape(x['Actual'],
x[[c for c in x if 'Forecast' in c]])).\
to_frame().reset_index()
This results in a dataframe with a single dictionary column.
The single column can be converted to multiple columns for the correct format:
# Convert the dictionary in a single column into 4 columns with proper names
# and concantenate column-wise
df_grp = pd.concat([new_df.drop(columns=[0]),
pd.DataFrame(list(new_df[0].values))], axis=1)
Result:
without changing the functions
applying four times
df_gr1 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_1')
df_gr2 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_2')
df_gr3 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_3')
df_gr4 = df.groupby(['City','Person','DT']).apply(wmape_gr,'Actual','Forecast_4')
join them together
all1= pd.concat([df_gr1, df_gr2,df_gr3,df_gr4],axis=1, sort=False)
get the columns for city,person and DT
all1['city']= [all1.index[i][0] for i in range(len(df_gr1))]
all1['Person']= [all1.index[i][1] for i in range(len(df_gr1))]
all1['DT']= [all1.index[i][2] for i in range(len(df_gr1))]
rename the columns and change order
df = all1.rename(columns={0:'Forecast_1_wmape', 1:'Forecast_2_wmape',2:'Forecast_3_wmape',3:'Forecast_4_wmape'})
df = df[['city','Person','DT','Forecast_1_wmape','Forecast_2_wmape','Forecast_3_wmape','Forecast_4_wmape']]
df=df.reset_index(drop=True)

Iterating over entire columns and storing the result into a list

I would like to know how I could iterate through each columns of a dataframe to perform some calculations and store the result in an another dataframe.
df_empty = []
m = daily.ix[:,-1] #Columns= stocks & Rows= daily returns
stocks = daily.ix[:,:-1]
for col in range (len(stocks.columns)):
s = daily.ix[:,col]
covmat = np.cov(s,m)
beta = covmat[0,1]/covmat[1,1]
return (beta)
print(beta)
In the above example, I first want to calculate a covariance matrix between "s" (the columns representing stocks daily returns and for which I want to iterate through one by one) and "m" (the market daily return which is my reference column/the last column of my dataframe). Then I want to calculate the beta for each covariance pair stock/market.
I'm not sure why return(beta) give me a single numerical result for one stock while print(beta) print the beta for all stocks.
I'd like to find a way to create a dataframe with all these betas.
beta_df = df_empty.append(beta)
I have tried the above code but it returns 'none' as if it could not append the outcome.
Thank you for your help
The return statement within your for-loop ends the loop itself the first time the return is encountered. Moreover, you are not saving the beta value anywhere because the for-loop itself does not return a value in python (it only has side effects).
Apart from that, you may choose a more pandas-like approach using apply on the data frame which basically iterates over the columns of the data frame and passes each column to a supplied function as the first parameter while returning the result of the function call. Here is a minimal working example with some dummy data:
import pandas as pd
import numpy as pd
# create some dummy data
daily = pd.DataFrame(np.random.randint(100, size=(100, 5)))
# define reference column
cov_column = daily.iloc[:, -1]
# setup computation function
def compute(column):
covmat = np.cov(column, cov_column)
return covmat[0,1]/covmat[1,1]
# use apply to iterate over columns
result = daily.iloc[:, :-1].apply(compute)
# show output
print(result)
0 -0.125382
1 0.024777
2 0.011324
3 -0.017622
dtype: float64

Categories