Has anybody tried to rich same results by implementing ElasticNetCV in Python and cvglmnet in R?
I have found out how to make it on ElasticNet in Python and glmnet in R but cannot reproduce it with cross validation methods...
Steps to reproduce in Python:
Preprocessing:
from sklearn.datasets import make_regression
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd
data = make_regression(
n_samples=100000,
random_state=0
)
X, y = data[0], data[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)
pd.DataFrame(X_train).to_csv('X_train.csv', index=None)
pd.DataFrame(X_test).to_csv('X_test.csv', index=None)
pd.DataFrame(y_train).to_csv('y_train.csv', index=None)
pd.DataFrame(y_test).to_csv('y_test.csv', index=None)
Models:
model = ElasticNet(
alpha=1.0,
l1_ratio=0.5,
fit_intercept=True,
normalize=True,
precompute=False,
max_iter=100000,
copy_X=True,
tol=0.0000001,
warm_start=False,
positive=False,
random_state=0,
selection='cyclic'
)
model.fit(
X=X_train,
y=y_train
)
y_pred = model.predict(
X=X_test
)
print(
mean_squared_error(
y_true=y_test,
y_pred=y_pred
)
)
output: 42399.49815189786
model = ElasticNetCV(
l1_ratio=0.5,
eps=0.001,
n_alphas=100,
alphas=None,
fit_intercept=True,
normalize=True,
precompute=False,
max_iter=100000,
tol=0.0000001,
cv=10,
copy_X=True,
verbose=0,
n_jobs=-1,
positive=False,
random_state=0,
selection='cyclic'
)
model.fit(
X=X_train,
y=y_train
)
y_pred = model.predict(
X=X_test
)
print(
mean_squared_error(
y_true=y_test,
y_pred=y_pred
)
)
output: 39354.729173913176
Steps to reproduce in R:
Preprocssing:
library(glmnet)
X_train <- read.csv(path)
X_test <- read.csv(path)
y_train <- read.csv(path)
y_test <- read.csv(path)
fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error = y_test - y_pred
mean(as.matrix(y_error)^2)
output: 42399.5
fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)
output: 37.00207
Thanks so much for providing the example, I am on a laptop so I had to reduce the number of samples to 100:
from sklearn.datasets import make_regression
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import pandas as pd
data = make_regression(
n_samples=100,
random_state=0
)
X, y = data[0], data[1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)
When you do predict in with glmnet, you need to specify lambda, otherwise it returns the predictions for all lambdas, so in R:
fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
dim(y_pred)
[1] 25 89
When you run cv.glmnet, it selects the best lambda from cv, the lambda.1se, so it gives you only 1 set, which is the rmse you wanted:
fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)
[1] 22.03504
dim(y_error)
[1] 25 1
fit$lambda.1se
[1] 1.278699
If we select the lambda closest to that chosen by cv.glmnet in glmnet, you get back something in the correct range:
fit <- glmnet(x=as.matrix(X_train), y=as.matrix(y_train))
sel = which.min(fit$lambda-1.278699)
y_pred <- predict(fit, newx = as.matrix(X_test))[,sel]
mean((y_test - y_pred)^2)
dim(y_error)
mean(as.matrix((y_test - y_pred)^2))
[1] 20.0775
Before we compare with sklearn, we need to make sure we are testing over the same range of lambdas.
L = c(0.01,0.05,0.1,0.2,0.5,1,2)
fit <- cv.glmnet(x=as.matrix(X_train), y=as.matrix(y_train),lambda=L)
y_pred <- predict(fit, newx = as.matrix(X_test))
y_error <- y_test - y_pred
mean(as.matrix(y_error)^2)
[1] 0.003065869
So we expect something in the range of 0.003065869. We run it with the same lambda, lambda is termed as alpha in ElasticNet. The alpha in glmnet is in fact your l1_ratio, see vignette. And the normalize option should be set to False, because:
If True, the regressors X will be normalized before regression by
subtracting the mean and dividing by the l2-norm. If you wish to
standardize, please use sklearn.preprocessing.StandardScaler before
calling fit on an estimator with normalize=False.
So we just run it using CV:
model = ElasticNetCV(l1_ratio=1,fit_intercept=True,alphas=[0.01,0.05,0.1,0.2,0.5,1,2])
model.fit(X=X_train,y=y_train)
y_pred = model.predict(X=X_test)
mean_squared_error(y_true=y_test,y_pred=y_pred)
0.0018007824874741929
It's around the same ball park as the R result.
And if you do it for ElasticNet, you will get the same result, if you specify alpha.
Related
I want to change my code so that instead of this part:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100, test_size=0.2)
train_data = X_train.copy()
train_data.loc[:, 'target'] = y_train
test_data = X_test.copy()
test_data.loc[:, 'target'] = y_test
data_config = DataConfig(
target=['target'], #target should always be a list. Multi-targets are only supported for
regression. Multi-Task Classification is not implemented
continuous_cols=train_data.columns.tolist(),
categorical_cols=[],
normalize_continuous_features=True
)
trainer_config = TrainerConfig(
auto_lr_find=True,
batch_size=64,
max_epochs=10,
)
optimizer_config = {'optimizer':'Adam', 'optimizer_params':{'weight_decay': 0, 'amsgrad':
False}, 'lr_scheduler':None, 'lr_scheduler_params':{},
'lr_scheduler_monitor_metric':'valid_loss'}
model_config = NodeConfig(
task="classification",
num_layers=2,
num_trees=512,
learning_rate=1,
embed_categorical=True,
)
tabular_model = TabularModel(
data_config=data_config,
model_config=model_config,
optimizer_config=optimizer_config,
trainer_config=trainer_config,
)
tabular_model.fit(train=train_data, test=test_data)
pred = tabular_model.predict(test_data)
pred['prediction'] = pred['prediction'].astype(int)
pred.loc[(pred['prediction'] >= 1 )] = 1
print_metrics(test_data['target'], pred["prediction"].astype('int'), tag="Holdout")
I want to Use the K fold method with k = 5 or 10.
Thank you for your advice.
The complete code example that I have used method train_test_split is above.
Here is an example of the k-fold method:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm
X, y = datasets.load_iris(return_X_y=True)
X.shape, y.shape
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.4, random_state=0)
X_train.shape, y_train.shape
X_test.shape, y_test.shape
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
result (in this example):
0.9666666666666667
The example is from here: https://scikit-learn.org/stable/modules/cross_validation.html
I am trying to code a multiple linear regression problem using two different methods. One is the simple one as stated below:
from sklearn.model_selection import train_test_split
X = df[['geo','age','v_age']]
y = df['freq']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Fitting model
regr2 = linear_model.LinearRegression()
regr2.fit(X_train, y_train)
print(metrics.mean_squared_error(ypred,y_test))
print(r2_score(y_test,ypred))
The above code gives me an MSE of 0.46 and a Y2 score of '0.0012' which is really bad fit. Meanwhile when I use:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=1) #Degree = 1 should give the same equation as above code block
X_ = poly.fit_transform(X)
y = y.values.reshape(-1, 1)
predict_ = poly.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X_, predict_, test_size=0.33, random_state=42)
# Fitting model
regr2 = linear_model.LinearRegression()
regr2.fit(X_train, y_train)
print(metrics.mean_squared_error(ypred,y_test))
print(r2_score(y_test,ypred))
Using PolynomialFeatures gives me an MSE of 0.23 and a Y2 score of '0.5' which is much much better. I don't understand how two methods using the same regression equation give such different answers. Rest everything else is the same.
I have a function with a regression loop built in. I want to assign the rsquareds from each iteration to an object that I can print out later.
here's part of the function (including the regression) for brevity:
cuts = [stats, stats_po, stats_ic, stats_id, stats_h, stats_a, stats_bos, stats_bkn, stats_nyk, stats_phi, stats_tor, stats_chi, stats_cle, stats_det, stats_ind, stats_mil, stats_den, stats_min, stats_okc, stats_por, stats_uta, stats_gsw, stats_lac, stats_lal, stats_phx, stats_sac, stats_atl, stats_cha, stats_mia, stats_orl, stats_was, stats_dal, stats_hou, stats_mem, stats_nop, stats_sas, stats_o1, stats_o2, stats_d1, stats_d2, stats_l25]
def process_cuts(c):
c = c.dropna(axis=0,how='all')
n = c.team.str.rsplit(" ",n=1, expand=True)
c['city'] = n[0]
c['team_name']=n[1]
c['team_name']=c['team_name'].str.replace('Trailblazers','Blazers')
c['team_name']=c['team_name'].str.replace('Bobcats','Hornets')
for z in ['Points','Steals','Blocks','Assists','OReb','DefReb','Turnovers','FieldGoals','ThreeShots','FTP', 'Fouls','FTMiss','FGMiss','FreeThrows']:
y = mergered[z]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coeff_df = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
print(coeff_df)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
from sklearn import metrics
rsquared = 'Rsquared:' + ' ' + z, metrics.r2_score(y_test,y_pred)
cuts_diffs = list(map(process_cuts, cuts))
I want to store the rsquareds for each y and print them out for each data cut.
appreciate your help
I try logistic regression classification using "k-fold cross validation" in python.
my code:
`import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix,roc_auc_score
data = pd.read_csv('xxx.csv')
X = data[["a","b","c",...]]
y = data["Class"]
def get_predictions(clf, X_train, y_train, X_test):
clf = clf
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
y_pred_prob = clf.predict_proba(X_test)
train_pred = clf.predict(X_train)
print('train-set confusion matrix:\n', confusion_matrix(y_train,train_pred))
return y_pred, y_pred_prob
skf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 0)
pred_test_full=0
cv_score=[]
i=1
for train_index, test_index in skf.split(X, y):
X_train, y_train = X.loc[train_index], y.loc[train_index]
X_test, y_test = X.loc[test_index], y.loc[test_index]
log_cfl = LogisticRegression(C=2);
log_cfl.fit(X_train, y_train)
y_pred, y_pred_prob = get_predictions(LogisticRegression(C=2), X_train, y_train, X_test)
score=roc_auc_score(y_test,log_cfl.predict(X_test))
print('ROC AUC score: ',score)
cv_score.append(score)
pred_test_full = pred_test_full + y_pred_prob
i+=1`
I get error at this line of code:
`pred_test_full = pred_test_full + y_pred_prob`
For loop runs 2 times. Then in third, I get the error.
'operands could not be broadcast together with shapes <56962,2> <5696..' error.
I couldn't understand what is wrong, could you help to figure out?
I'm using scikit-learn to train some classifiers. I do cross validation and then compute AUC. However I'm getting a different AUC number every time I run the tests although I made sure to use a seed and a RandomState. I want my tests to be deterministic. Here's my code:
from sklearn.utils import shuffle
SEED = 0
random_state = np.random.RandomState(SEED)
X, y = shuffle(data, Y, random_state=random_state)
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, y, test_size=test_size, random_state=random_state)
clf = linear_model.LogisticRegression()
kfold = cross_validation.KFold(len(X), n_folds=n_folds)
mean_tpr = 0.0
mean_fpr = np.linspace(0, 1, 100)
for train, test in kfold:
probas_ = clf.fit(X[train], Y[train]).predict_proba(X[test])
fpr, tpr, thresholds = roc_curve(Y[test], probas_[:, 1])
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
mean_tpr /= len(kfold)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
My questions:
1- Is there something wrong in my code that's making the results different each time I run it?
2- Is there a global way to make scikit deterministic?
EDIT:
I just tried this:
test_size = 0.5
X = np.random.randint(10, size=(10,2))
Y = np.random.randint(2, size=(10))
SEED = 0
random_state = np.random.RandomState(SEED)
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, Y, test_size=test_size, random_state=random_state)
print X_train # I recorded the result
Then I did:
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, Y, test_size=test_size, random_state=6) #notice the change in random_state
Then I did:
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, Y, test_size=test_size, random_state=random_state)
print X_train #the result is different from the first one!!!!
As you see I'm getting different results although I used the same random_state! How to solve this?
LogisticRegression uses randomness internally and has an (undocumented, will fix in a moment) random_state argument.
There's no global way of setting the random state, because unfortunately the random state on LogisticRegression and the SVM code can only be set in a hacky way. That's because this code comes from Liblinear and LibSVM, which use the C standard library's rand function and that cannot be seeded in a principled way.
EDIT The above is true, but probably not the cause of the problem. You're threading a single np.random.RandomState through your calls, while you should pass the same integer seed for easy reproducibility.