Creating a bootstrap sample by group in python

Creating a bootstrap sample by group in python - python

I have a dataframe looking something like that:
y X1 X2 X3
ID year
1 2010 1 2 3 4
1 2011 3 4 5 6
2 2010 1 2 3 4
2 2011 3 4 5 6
2 2012 7 8 9 10
...
I'd like to create several bootstrap sample from the original df, calculate a fixed effects panel regression on the new bootstrap samples and than store the corresponding beta coefficients. The approach I found for "normal" linear regression is the following
betas = pd.DataFrame()
for i in range(10):
# Creating a bootstrap sample with replacement
bootstrap = df.sample(n=df.shape[0], replace=True)
# Fit the regression and save beta coefficients
DV_bs = bootstrap.y
IV_bs = sm2.add_constant(bootstrap[['X1', 'X2', 'X3']])
fe_mod_bs = PanelOLS(DV_bs, IV_bs, entity_effects=True ).fit(cov_type='clustered', cluster_entity=True)
b = pd.DataFrame(fe_mod_bs.params)
print(b.head())
betas = pd.concat([betas, b], axis = 1, join = 'outer')
Unfortunately the bootstrap samples need to be selected by group for the panel regression, so that a complete ID is picked instead of just one row. I could not figure out how to extend the function to create a sample that way. So I basically have two questions:
Does the overall approach make sense for panel regression at all?
How do I adjust the bootstrapping so that the multilevel / panel structure is taken into account and complete IDs instead of single rows are "picked" during the bootstrapping?

I solved my problem with the following code:
companies = pd.DataFrame(df.reset_index().Company.unique())
betas_summary = pd.DataFrame()
for i in tqdm(range(1, 10001)):
# Creating a bootstrap sample with replacement
bootstrap = companies.sample(n=companies.shape[0], replace=True)
bootstrap.rename(columns={bootstrap.columns[0]: "Company"}, inplace=True)
Period = list(range(1, 25))
list_of_bs_comp = bootstrap.Company.to_list()
multiindex = [list_of_bs_comp, np.array(Period)]
bs_df = pd.MultiIndex.from_product(multiindex, names=['Company', 'Period'])
bs_result = df.loc[bs_df, :]
betas = pd.DataFrame()
# Fit the regression and save beta coefficients
DV_bs = bs_result.y
IV_bs = sm2.add_constant(bs_result[['X1', 'X2', 'X3']])
fe_mod_bs = PanelOLS(DV_bs, IV_bs, entity_effects=True ).fit(cov_type='clustered', cluster_entity=True)
b = pd.DataFrame(fe_mod_bs.params)
b.rename(columns={'parameter':"b"}, inplace=True)
betas = pd.concat([betas, b], axis = 1, join = 'outer')
where Company is my entity variable and Period is my time variable

Related

Applying Featuretools output to another dataframe

I have a dataframe which the target features, which looks like this:
x x1 y
1 2 3
2 3 4
Now I use feautretools to automatically do feature engineering, using this line of code:
es = ft.EntitySet(id = 'x')
es.entity_from_dataframe(entity_id = 'y', dataframe = df, index = 'x')
feature_matrix, feature_names = ft.dfs(entityset=es,
target_entity = 'y',
max_depth = 2,
verbose = 1,
n_jobs = 3)
I would like to take the features generated, and then apply them to a dataset which lacks the labels, something which looks like this:
x x1
1 2
How would I take the features generate (e.g mean of x + x1) and then map their creation process ((df['x']+df['x1']).mean()) onto the dataframe lacking the label?

This answered my question, the saving feature part:
https://featuretools.alteryx.com/en/stable/guides/deployment.html

How do I run multiple linear models for unique IDs and put the results in a single dataframe by the unique IDs?

How do I get the regression intercept and coefficient data for unique IDs in a dataframe into a single dataframe where each row has the UID, it's intercept, and it's coefficients?
This is a snippet of what my raw data looks like. Future data can have more UIDs and more fields (independent variables).
UID
A1
A2
A3
A4
Rating
1
0.377489423
0.950311846
0.892135293
0.077054085
4
1
0.595570737
0.824334482
0.388634543
0.947936483
4
1
0.585703124
0.825486315
0.569809886
0.321117521
3
1
0.386968371
0.594556911
0.260187376
0.394238102
4
1
0.532731866
0.219741858
0.865710517
0.173044631
3
1
0.16565561
0.125096015
0.881841651
0.494690133
4
2
0.42418965
0.814894214
0.989426645
0.871014023
1
2
0.742604257
0.571780036
0.247811255
0.468820653
2
2
0.401989919
0.375134173
0.539599593
0.443260146
3
2
0.167910365
0.940073739
0.490081723
0.803074574
5
2
0.614160221
0.045817359
0.077645469
0.367456074
4
3
0.866397055
0.2932472
0.968410252
0.348542304
5
3
0.141680391
0.998446121
0.201506356
0.689863785
1
3
0.407182414
0.721650663
0.174277013
0.922810374
1
Here is the code I wrote to loop through each unique UID and run the linear model and add the intercept and coefficients for each UID to a list.
ids = df.UID.unique()
op=[]
for i in ids:
df_i = df[df.UID == i]
X =df_i.drop(['UID','Rating'], axis=1)
y= df_i['Rating']
reg = LinearRegression().fit(X, y)
reg.score(X, y)
const = reg.intercept_
coef = reg.coef_
op.append(const)
op.append(coef)
op
I would like my output to look like this format (the data shown is dummy data). So each row has the UID, it's intercept, and the linear regression coefficients. This is where I am stuck.
UID
Intercept
A1
A2
A3
A4
1
3.2343
0.950311846
0.892135293
0.077054085
4.3454
2
2.123
0.824334482
0.388634543
0.947936483
2.3454
3
3.455
0.825486315
0.569809886
0.321117521
3.12343
Feel free to comment on the initial approach to get the regression models as well.
Thank you

Here is what I came up with/. I just need to add the UID, not sure how to add that for each row.
ids = df.UID.unique()
op = pd.DataFrame
intercept = []
coefficients=[]
UID = []
for i in ids:
df_i = df[df.UID == i]
X =df_i.drop(['UID','Rating'], axis=1)
y= df_i['Rating']
reg = LinearRegression().fit(X, y)
reg.score(X, y)
unique_id=df_i['UID'].unique()
const = reg.intercept_
coef = reg.coef_
UID.append(unique_id)
intercept.append(const)
coefficients.append(coef)
intercep_new = pd.DataFrame(intercept)
coefficients_new = pd.DataFrame(coefficients)
UID_new = pd.DataFrame(UID)
colNames = df.drop(['Rating',], axis=1).columns
colNames = colNames.insert(1, 'Const')
colNames
op = pd.concat([UID_new,intercep_new, coefficients_new], axis=1)
op.columns = colNames

See changes below:
ids = df.UID.unique()
op=pd.DataFrame()
for i in ids:
df_i = df[df.UID == i]
X =df_i.drop(['UID','Rating'], axis=1)
y= df_i['Rating']
reg = LinearRegression().fit(X, y)
reg.score(X, y)
const = reg.intercept_
coef = reg.coef_
uid=i
array=np.append(coef,const)
array=np.append(array,uid)
array=array.reshape(1,len(array))
df_append=pd.DataFrame(array)
op=op.append(df_append)
op.columns=['A'+str(i) for i in range (1,len(op.columns)+1)]
op.rename(columns={op.columns[-1]:"UID"},inplace=True)
op.rename(columns={op.columns[-2]:"Intercept"},inplace=True)
op=op.reset_index().drop('index',axis=1)
op=op.drop_duplicates()

How to perform time series analysis that contains multiple groups in Python using fbProphet or other models?

All,
My dataset looks like following. I am trying to predict the 'amount' for next 6 months using either the fbProphet or other model. But my issue is that I would like to predict amount based on each groups i.e A,B,C,D for next 6 months. I am not sure how to do that in python using fbProphet or other model ? I referenced official page of fbprophet, but the only information I found is that "Prophet" takes two columns only One is "Date" and other is "amount" .
I am new to python, so any help with code explanation is greatly appreciated!
import pandas as pd
data = {'Date':['2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01','2017-05-01','2017-06-01','2017-07-01'],'Group':['A','B','C','D','C','A','B'],
'Amount':['12.1','13','15','10','12','9.0','5.6']}
df = pd.DataFrame(data)
print (df)
output:
Date Group Amount
0 2017-01-01 A 12.1
1 2017-02-01 B 13
2 2017-03-01 C 15
3 2017-04-01 D 10
4 2017-05-01 C 12
5 2017-06-01 A 9.0
6 2017-07-01 B 5.6

fbprophet requires two columns ds and y, so you need to first rename the two columns
df = df.rename(columns={'Date': 'ds', 'Amount':'y'})
Assuming that your groups are independent from each other and you want to get one prediction for each group, you can group the dataframe by "Group" column and run forecast for each group
from fbprophet import Prophet
grouped = df.groupby('Group')
for g in grouped.groups:
group = grouped.get_group(g)
m = Prophet()
m.fit(group)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
print(forecast.tail())
Take note that the input dataframe that you supply in the question is not sufficient for the model because group D only has a single data point. fbprophet's forecast needs at least 2 non-Nan rows.
EDIT: if you want to merge all predictions into one dataframe, the idea is to name the yhat for each observations differently, do pd.merge() in the loop, and then cherry-pick the columns that you need at the end:
final = pd.DataFrame()
for g in grouped.groups:
group = grouped.get_group(g)
m = Prophet()
m.fit(group)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
forecast = forecast.rename(columns={'yhat': 'yhat_'+g})
final = pd.merge(final, forecast.set_index('ds'), how='outer', left_index=True, right_index=True)
final = final[['yhat_' + g for g in grouped.groups.keys()]]

import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.stattools import adfuller
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
# Before doing any modeling using ARIMA or SARIMAS etc Confirm that
# your time-series is stationary by using Augmented Dick Fuller test
# or other tests.
# Create a list of all groups or get from Data using np.unique or other methods
groups_iter = ['A', 'B', 'C', 'D']
dict_org = {}
dict_pred = {}
group_accuracy = {}
# Iterate over all groups and get data
# from Dataframe by filtering for specific group
for i in range(len(groups_iter)):
X = data[data['Group'] == groups_iter[i]]['Amount'].values
size = int(len(X) * 0.70)
train, test = X[0:size], X[size:len(X)]
history = [x for in train]
# Using ARIMA model here you can also do grid search for best parameters
for t in range(len(test)):
model = ARIMA(history, order = (5, 1, 0))
model_fit = model.fit(disp = 0)
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test[t]
history.append(obs)
print("Predicted:%f, expected:%f" %(yhat, obs))
error = mean_squared_log_error(test, predictions)
dict_org.update({groups_iter[i]: test})
dict_pred.update({group_iter[i]: test})
print("Group: ", group_iter[i], "Test MSE:%f"% error)
group_accuracy.update({group_iter[i]: error})
plt.plot(test)
plt.plot(predictions, color = 'red')
plt.show()

I know this is old but I was trying to predict outcomes for different clients and I tried to use Aditya Santoso solution above but got into some errors, so I added a couple of modifications and finally this worked for me:
df = pd.read_csv('file.csv')
df = pd.DataFrame(df)
df = df.rename(columns={'date': 'ds', 'amount': 'y', 'client_id': 'client_id'})
#I had to filter first clients with less than 3 records to avoid errors as prophet only works for 2+ records by group
df = df.groupby('client_id').filter(lambda x: len(x) > 2)
df.client_id = df.client_id.astype(str)
final = pd.DataFrame(columns=['client','ds','yhat'])
grouped = df.groupby('client_id')
for g in grouped.groups:
group = grouped.get_group(g)
m = Prophet()
m.fit(group)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
#I added a column with client id
forecast['client'] = g
#I used concat instead of merge
final = pd.concat([final, forecast], ignore_index=True)
final.head(10)

How to do Naive Bayes modelling (using sklearn MultinomialNB) in python

I am currently learning how to do Naive Bayes modelling and attempting to apply it in python and R however, using a toy example, I am struggling to recreate the same numbers in python that I get from doing the calculations in either R or by hand.
Help in figuring out why I am getting different numbers would be appreciated!
The toy data is
Class (y) A A A A B B B B B B
var x1 2 1 1 0 0 1 1 0 0 0
var x2 0 0 1 0 0 1 1 1 1 1
That is to say my dependent variable y has 2 levels A & B, explanatory variable x1 has 3 levels 0,1,2 and x2 has two levels 0 & 1.
My current objective is to predict, using a multinomial naivebayes model, the class probabilities of a new data point with values x1=1 & x2=1.
My current python code is:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
dat = pd.DataFrame({
"class" : ["A", "A","A","A", "B","B","B","B","B","B"],
"x1" : [2,1,1,0,0,1,1,0,0,0],
"x2" : [0,0,1,0,1,0,1,1,1,1]
})
mnb = MultinomialNB(alpha= 0)
x = mnb.fit(dat[["x1","x2"]], dat["class"])
x.predict_proba( pd.DataFrame( [[1,1]] , columns=["x1","x2"]) )
## Out[160]: array([[ 0.34325744, 0.65674256]])
However attempting the same in R I get:
library(dplyr)
library(e1071)
dat = data_frame(
"class" = c("A", "A","A","A", "B","B","B","B","B","B"),
"x1" = c(2,1,1,0,0,1,1,0,0,0),
"x2" = c(0,0,1,0,1,0,1,1,1,1)
)
model <- naiveBayes(class ~ . , data = table(dat) )
predict(
model,
newdata = data_frame(
x1 = factor(1, levels = c(0,1,2)) ,
x2 = factor(1, levels = c(0,1))),
type = "raw"
)
## A B
## [1,] 0.2307692 0.7692308
And by hand I get the following:
The model is
From the data we get the following probability estimates
Thus plugging the numbers in we get
Which matches the results from R. So again I'm confused as to what I am doing wrong in the python example. Any help would be appreciated.

How do I modify this function to accept multiple Dataframes?

I wrote this function and I would like it to accept more than one DF so that the final plot has multiple plotted lines for the predictions and the coef_DF gets completed with the rest of the coefficients.
The function extracts the needed features and target from a much larger dataset to make predictions using a linear regression func, it then makes the model, plots the line over the dataset and returns a df with all the coeficients.
(This is just an exercise.)
def prep_model_and_predict(feature, target, dataset, degree):
# part 1: make a df with relevant format and features
# degree >=1
poly_df=pd.DataFrame()
poly_df[str(target)] = dataset[str(target)]
poly_df['power_1'] = dataset[str(feature)]
#cehck if degree >1
if degree > 1:
for power in range(2, degree+1): #loop over reaming deg
name = 'power_'+str(power)
poly_df[name]=poly_df['power_1'].apply(lambda x: x**power)
#part 2: make model and predictions
features=list(poly_df.columns[1:])
X=poly_df[features]
y=poly_df[str(target)]
model=LinearRegression().fit(X,y)
predictions=model.predict(X)
#part 3: put weghts in a nice df
coef_df=pd.DataFrame()
coef_df=coef_df.append({"Name":'Intercept', 'Value':model.intercept_}, ignore_index=True)
coef_df=coef_df.append({'Name':'Power_1', 'Value':model.coef_[0]}, ignore_index=True)
if degree > 1:
for degree in range(2, degree+1):
name = 'Power_' + str(degree)
coef_df = coef_df.append({"Name":name,
'Value':'{:.3e}'.format(model.coef_[degree-1])}, ignore_index=True)
#prt 4: plot it
fig, ax = plt.subplots()
ax.plot(poly_df['power_1'], poly_df[str(target)], '.',
poly_df['power_1'], predictions, '-')
ax.set_xlabel('Square footage, living area')
ax.set_ylabel('Price per Sqft')
ax.ticklabel_format(axis='y', style='sci', scilimits=(-2,2))
return coef_df, ax
and this is the result:
Name Value
0 Intercept 506738
1 Power_1 2.71336e-77
2 Power_2 7.335e-39
3 Power_3 -1.850e-44
4 Power_4 8.437e-50
5 Power_5 0.000e+00
6 Power_6 0.000e+00
7 Power_7 3.645e-55
8 Power_8 1.504e-51
9 Power_9 5.760e-48
10 Power_10 1.958e-44
11 Power_11 5.394e-41
12 Power_12 9.404e-38
13 Power_13 -3.635e-41
14 Power_14 4.655e-45
15 Power_15 -1.972e-49
much appreciated!

I am not sure what exactly you are asking for. But I would suggest, next time try to ask a question that is easily produce-able and runnable by other people here in SO.
I have tried to answer your questions. Correct me if I misunderstand your question.
Pass arbitrary number of DataFrame to your function and plot it:
I have created three random dataframes for use:
df1 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
df2 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
df3 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
The functions that plots them:
def plot_me(*kwargs):
plt.figure(figsize=(13,9))
lab_ind = 0
for i in kwargs:
plt.plot(i['A'], i['B'], label = lab_ind)
lab_ind += 1
plt.legend()
plt.show()
The result plot you get:
Put the results of your model into a DataFrame
Regarding your second question, I am not going to concentrate too much on your exact details - for example the name of the columns of your dataframe, etc.
For this particular example I have generated two random arrays:
X = np.random.randint(0,50 ,size=(50, 2))
y = np.random.randint(0,2 ,size=(50, 1))
Then fit a LinearRegression model on this data.
model=LinearRegression().fit(X,y)
predictions=model.predict(X)
And then add it to a DataFrame:
res_df = pd.DataFrame(predictions,columns = ['Value'])
And if you print res_df
Value
0 0.420395
1 0.459389
2 0.369648
3 0.416058
4 0.644088
5 0.362072
6 0.363157
7 0.468943
. .
. .

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a bootstrap sample by group in python - python

Related

Applying Featuretools output to another dataframe

How do I run multiple linear models for unique IDs and put the results in a single dataframe by the unique IDs?

How to perform time series analysis that contains multiple groups in Python using fbProphet or other models?

How to do Naive Bayes modelling (using sklearn MultinomialNB) in python

How do I modify this function to accept multiple Dataframes?

Categories

Resources