I am currently learning how to do Naive Bayes modelling and attempting to apply it in python and R however, using a toy example, I am struggling to recreate the same numbers in python that I get from doing the calculations in either R or by hand.
Help in figuring out why I am getting different numbers would be appreciated!
The toy data is
Class (y) A A A A B B B B B B
var x1 2 1 1 0 0 1 1 0 0 0
var x2 0 0 1 0 0 1 1 1 1 1
That is to say my dependent variable y has 2 levels A & B, explanatory variable x1 has 3 levels 0,1,2 and x2 has two levels 0 & 1.
My current objective is to predict, using a multinomial naivebayes model, the class probabilities of a new data point with values x1=1 & x2=1.
My current python code is:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
dat = pd.DataFrame({
"class" : ["A", "A","A","A", "B","B","B","B","B","B"],
"x1" : [2,1,1,0,0,1,1,0,0,0],
"x2" : [0,0,1,0,1,0,1,1,1,1]
})
mnb = MultinomialNB(alpha= 0)
x = mnb.fit(dat[["x1","x2"]], dat["class"])
x.predict_proba( pd.DataFrame( [[1,1]] , columns=["x1","x2"]) )
## Out[160]: array([[ 0.34325744, 0.65674256]])
However attempting the same in R I get:
library(dplyr)
library(e1071)
dat = data_frame(
"class" = c("A", "A","A","A", "B","B","B","B","B","B"),
"x1" = c(2,1,1,0,0,1,1,0,0,0),
"x2" = c(0,0,1,0,1,0,1,1,1,1)
)
model <- naiveBayes(class ~ . , data = table(dat) )
predict(
model,
newdata = data_frame(
x1 = factor(1, levels = c(0,1,2)) ,
x2 = factor(1, levels = c(0,1))),
type = "raw"
)
## A B
## [1,] 0.2307692 0.7692308
And by hand I get the following:
The model is
From the data we get the following probability estimates
Thus plugging the numbers in we get
Which matches the results from R. So again I'm confused as to what I am doing wrong in the python example. Any help would be appreciated.
Related
I am trying to get RF feature importance, I fit the random forest on the data like this:
model = RandomForestRegressor()
n = model.fit(self.X_train,self.y_train)
if n is not None:
df = pd.DataFrame(data = n , columns = ["Feature","Importance_Score"])
df["Feature_Name"] = np.array(self.X_Headers)
df = df.drop(["Feature"], axis = 1)
df[["Feature_Name","Importance_Score"]].to_csv("RF_Importances.csv", index = False)
del df
However, the n variable returns None, why is this happening?
Not very sure how model.fit(self.X_train,self.y_train) is supposed to work. Need more information about how you set up the model.
If we set this up using simulated data, it works:
np.random.seed(111)
X = pd.DataFrame(np.random.normal(0,1,(100,5)),columns=['A','B','C','D','E'])
y = np.random.normal(0,1,100)
model = RandomForestRegressor()
n = model.fit(X,y)
if n is not None:
df = pd.DataFrame({'features':X.columns,'importance':n.feature_importances_})
df
features importance
0 A 0.176091
1 B 0.183817
2 C 0.169927
3 D 0.267574
4 E 0.202591
So I have 1000 class 1 and 2500 class 2. So naturally when using:
sklearn's train_test_split(test_size = 200, stratify = y). I get an imbalanced test set since it is preserving the data distribution from the original data set. However, I would like to split to have 100 class 1 and 100 class 2 in the test set.
How would I do it? Any suggestions would be appreciated.
Split Manually
A manual solution isn't that scary. Main steps explained:
Isolate the index of class-1 and class-2 rows.
Use np.random.permutation() to select random n1 and n2 test samples for class 1 and 2 respectively.
Use df.index.difference() to perform inverse selection for the train samples.
The code can be easily generalized to arbitrary number of classes and arbitrary numbers to be selected as test data (just put n1/n2, idx1/idx2, etc. into lists and process by loops). But that's out of the scope of the question itself.
Code
import numpy as np
from sklearn.model_selection import train_test_split
import pandas as pd
# data
df = pd.DataFrame(
data={
"label": np.array([1]*1000 + [2]*2500),
# label 1 has value > 0, label 2 has value < 0
"value": np.hstack([np.random.uniform(0, 1, 1000),
np.random.uniform(-1, 0, 2500)])
}
)
df = df.sample(frac=1).reset_index(drop=True)
# sampling number for each class
n1 = 100
n2 = 100
# 1. get indexes and lengths for the classes respectively
idx1 = df.index.values[df["label"] == 1]
idx2 = df.index.values[df["label"] == 2]
len1 = len(idx1) # 1000
len2 = len(idx2) # 2500
# 2. draw index for test dataset
draw1 = np.random.permutation(len1)[:n1] # keep the first n1 entries to be selected
idx1_test = idx1[draw1]
draw2 = np.random.permutation(len2)[:n2]
idx2_test = idx2[draw2]
# combine the drawn indexes
idx_test = np.hstack([idx1_test, idx2_test])
# 3. derive index for train dataset
idx_train = df.index.difference(idx_test)
# split
df_train = df.loc[idx_train, :] # optional: .reset_index(drop=True)
df_test = df.loc[idx_test, :]
# len(df_train) = 3300
# len(df_test) = 200
# verify that no row was missing
idx_merged = np.hstack([df_train.index.values, df_test.index.values])
assert len(np.unique(idx_merged)) == 3500
I have a dataframe looking something like that:
y X1 X2 X3
ID year
1 2010 1 2 3 4
1 2011 3 4 5 6
2 2010 1 2 3 4
2 2011 3 4 5 6
2 2012 7 8 9 10
...
I'd like to create several bootstrap sample from the original df, calculate a fixed effects panel regression on the new bootstrap samples and than store the corresponding beta coefficients. The approach I found for "normal" linear regression is the following
betas = pd.DataFrame()
for i in range(10):
# Creating a bootstrap sample with replacement
bootstrap = df.sample(n=df.shape[0], replace=True)
# Fit the regression and save beta coefficients
DV_bs = bootstrap.y
IV_bs = sm2.add_constant(bootstrap[['X1', 'X2', 'X3']])
fe_mod_bs = PanelOLS(DV_bs, IV_bs, entity_effects=True ).fit(cov_type='clustered', cluster_entity=True)
b = pd.DataFrame(fe_mod_bs.params)
print(b.head())
betas = pd.concat([betas, b], axis = 1, join = 'outer')
Unfortunately the bootstrap samples need to be selected by group for the panel regression, so that a complete ID is picked instead of just one row. I could not figure out how to extend the function to create a sample that way. So I basically have two questions:
Does the overall approach make sense for panel regression at all?
How do I adjust the bootstrapping so that the multilevel / panel structure is taken into account and complete IDs instead of single rows are "picked" during the bootstrapping?
I solved my problem with the following code:
companies = pd.DataFrame(df.reset_index().Company.unique())
betas_summary = pd.DataFrame()
for i in tqdm(range(1, 10001)):
# Creating a bootstrap sample with replacement
bootstrap = companies.sample(n=companies.shape[0], replace=True)
bootstrap.rename(columns={bootstrap.columns[0]: "Company"}, inplace=True)
Period = list(range(1, 25))
list_of_bs_comp = bootstrap.Company.to_list()
multiindex = [list_of_bs_comp, np.array(Period)]
bs_df = pd.MultiIndex.from_product(multiindex, names=['Company', 'Period'])
bs_result = df.loc[bs_df, :]
betas = pd.DataFrame()
# Fit the regression and save beta coefficients
DV_bs = bs_result.y
IV_bs = sm2.add_constant(bs_result[['X1', 'X2', 'X3']])
fe_mod_bs = PanelOLS(DV_bs, IV_bs, entity_effects=True ).fit(cov_type='clustered', cluster_entity=True)
b = pd.DataFrame(fe_mod_bs.params)
b.rename(columns={'parameter':"b"}, inplace=True)
betas = pd.concat([betas, b], axis = 1, join = 'outer')
where Company is my entity variable and Period is my time variable
I currently working on a project where I have to translate r code to python.
I came across an issue in Polynomial regression. There's a difference between the coefficients I got from R and Python.
Here's my Data :
stress_immo['stress immo'] = [0.0 , -0.2 ,-0.4]
stress_immo['Choc A - EQ T1'] = [-0.021951,-0.021951,-0.021951]
The code given to me in R is the following :
Reg_GF_S_cEQT1_A_a_RE <- lm(Choc.A...EQ.T1~stress.immo+ I(stress.immo^2), data=stress_immo)
The result of this is :
(Intercept) -2.195e-02 NA NA NA
stress.immo -9.014e-17 NA NA NA
I(stress.immo^2) -1.502e-16 NA NA NA
Here's my code in Python (very likely to be wrong) :
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
x = (stress_immo['stress immo'].values).reshape(-1,1)
qfit = PolynomialFeatures(degree=2)
xq = qfit.fit_transform(x)
y = (stress_immo['Choc A - EQ T1'].values).reshape(-1,1)
qr = LinearRegression()
model = qr.fit(xq,y)
and here are my results :
print(model.coef_)
[[0. 0. 0.]]
print(model.intercept_)
[-0.02195108]
As you can see the intercept is correct by coefficients are always 0 (no matter what data I choose), I also tried doing a linear-regression like so:
x =stress_immo['stress immo'].values
x2 = np.power(stress_immo['stress immo'].values,2)
vector_row = np.array([x,x2]).reshape(-1, 2)
y = stress_immo['Choc A - EQ T1'].values
model = LinearRegression().fit(vector_row,y)
but the result is always the same, 0 coefficient
I would be grateful if someone could help. Thanks,
My DataFrame:
from random import random, randint
from pandas import DataFrame
t = DataFrame({"metasearch":["A","B","A","B","A","B","A","B"],
"market":["A","B","A","B","A","B","A","B"],
"bid":[random() for i in range(8)],
"clicks": [randint(0,10) for i in range(8)],
"country_code":["A","A","A","A","A","B","A","B"]})
I want to fit LinearRegression for each market, so I:
1) Group df - groups = t.groupby(by="market")
2) Prepare function to fit model on a group -
from sklearn.linear_model import LinearRegression
def group_fitter(group):
lr = LinearRegression()
X = group["bid"].fillna(0).values.reshape(-1,1)
y = group["clicks"].fillna(0)
lr.fit(X, y)
return lr.coef_[0] # THIS IS A SCALAR
3) Create a new Series with market as an index and coef as a value:
s = groups.transform(group_fitter)
But the 3rd step fails: KeyError: ('bid_cpc', 'occurred at index bid')
I think you need instead transform use apply because working with more columns in function together and for new column use join:
from sklearn.linear_model import LinearRegression
def group_fitter(group):
lr = LinearRegression()
X = group["bid"].fillna(0).values.reshape(-1,1)
y = group["clicks"].fillna(0)
lr.fit(X, y)
return lr.coef_[0] # THIS IS A SCALAR
groups = t.groupby(by="market")
df = t.join(groups.apply(group_fitter).rename('new'), on='market')
print (df)
bid clicks country_code market metasearch new
0 0.462734 9 A A A -8.632301
1 0.438869 5 A B B 6.690289
2 0.047160 9 A A A -8.632301
3 0.644263 0 A B B 6.690289
4 0.579040 0 A A A -8.632301
5 0.820389 6 B B B 6.690289
6 0.112341 5 A A A -8.632301
7 0.432502 0 B B B 6.690289
Just return the group from the function instead of the coefficient.
# return the group instead of scaler value
def group_fitter(group):
lr = LinearRegression()
X = group["bid"].fillna(0).values.reshape(-1,1)
y = group["clicks"].fillna(0)
lr.fit(X, y)
group['coefficient'] = lr.coef_[0] # <- This is the changed line
return group
# the new column gets added to the data
s = groups.apply(group_fitter)