train_test_split affect result when predict value the same - python

I'm new in data science, I have a question about train_test_split.
I have a example try to predict ice tea sales from temperature
My Question is when I use train_test_split, my mse, score & predict sales value will be different every times (since train_test_split selected different part every times)
Is this normal? If user enter 30 degree same value every time and they will get different predict sales value?
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
#1. predict value
temperature = np.reshape(np.array([30]), (1, 1))
#2. data
X = np.array([29, 28, 34, 31, 25, 29, 32, 31, 24, 33, 25, 31, 26, 30]) #temperatures
y = np.array([77, 62, 93, 84, 59, 64, 80, 75, 58, 91, 51, 73, 65, 84]) #iced_tea_sales
X = np.reshape(X, (len(X), 1))
y = np.reshape(y, (len(y), 1))
#3. split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#4. train
lm = LinearRegression()
lm.fit(X_train, y_train)
#5. mse score
y_pred = lm.predict(X_test)
mse = np.mean((y_pred - y_test) ** 2)
r_squared = lm.score(X_test, y_test)
print(f'mse: {mse}')
print(f'score(r_squared): {r_squared}')
#6. predict
sales = lm.predict(temperature)
print(sales) #output, user get their prediction

The values will never be the same as when you fit() any model even on the same data multiple times, the weights learned may vary hence the predictions can never be the same. Though, they should be close enough (if you don't have outliers) as the distribution from which the samples are coming is common.

Related

Confidence interval AUC with the bootstrap method

today I attempted to make a bootstrap to obtain the interval confidence of various different ML algorithm AUC.
I used my personal medical dataset with 61 features formatted liked this :
Age
Female
65
1
45
0
For exemple I used this type of algorithm :
X = data_sevrage.drop(['Echec_sevrage'], axis=1)
y = data_sevrage['Echec_sevrage']
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.25, random_state=0)
lr = LogisticRegression(C=10 ,penalty='l1', solver= 'saga', max_iter=500).fit(X_train,y_train)
score=roc_auc_score(y_test,lr.predict_proba(X_test)[:,1])
precision, recall, thresholds = precision_recall_curve(y_test, lr.predict_proba(X_test)[:,1])
auc_precision_recall = metrics.auc(recall, precision)
y_pred = lr.predict(X_test)
print('ROC AUC score :',score)
print('auc_precision_recall :',auc_precision_recall)
And finally, when I used the boostrap method to obtain the confidence interval (I take the code from other topic : How to compare ROC AUC scores of different binary classifiers and assess statistical significance in Python?)
def bootstrap_auc(clf, X_train, y_train, X_test, y_test, nsamples=1000):
auc_values = []
for b in range(nsamples):
idx = np.random.randint(X_train.shape[0], size=X_train.shape[0])
clf.fit(X_train[idx], y_train[idx])
pred = clf.predict_proba(X_test)[:, 1]
roc_auc = roc_auc_score(y_test.ravel(), pred.ravel())
auc_values.append(roc_auc)
return np.percentile(auc_values, (2.5, 97.5))
bootstrap_auc(lr, X_train, y_train, X_test, y_test, nsamples=1000)
I have this error :
"None of [Int64Index([21, 22, 20, 31, 30, 13, 22, 1, 31, 3, 2, 9, 9, 18, 29, 30, 31,\n 31, 16, 11, 23, 7, 19, 10, 14, 5, 10, 25, 30, 24, 8, 20],\n dtype='int64')] are in the [columns]"
I use this other method, and i have nearly the same error :
n_bootstraps = 1000
rng_seed = 42 # control reproducibility
bootstrapped_scores = []
rng = np.random.RandomState(rng_seed)
for i in range(n_bootstraps):
# bootstrap by sampling with replacement on the prediction indices
indices = rng.randint(0, len(y_pred), len(y_pred))
if len(np.unique(y_test[indices])) < 2:
# We need at least one positive and one negative sample for ROC AUC
# to be defined: reject the sample
continue
score = roc_auc_score(y_test[indices], y_pred[indices])
bootstrapped_scores.append(score)
print("Bootstrap #{} ROC area: {:0.3f}".format(i + 1, score))
'[6, 3, 12, 14, 10, 7, 9] not in index'
Can you help me please ? I tested many solutions but I have this error every time.
Thank you !
Bootstrap method for AUC confidence interval on machine learning algorithm.
The problem is solved ! It's just a format problem, the conversion in numpy format solve it. Thank you !

Python Scikit-Learn RandomizedSearchCV with custom scoring functions

I am using Scikit-Learn's Random Forest Regressor, Pipeline, and RandomizedSearchCV to predict the target variable using some features in my dataset. I need to use my own custom scoring functions that calculate weighted scores using weights (signifying the importance of observations) from the dataset. My code seems to work but I am getting a warning when the grid runs:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for examples using ravel(). self.__final_estimator.fit(Xt, y, **fit_params)
This is related to .fit(X_train, y_train). Based on this warning, if I change the code to .fit(X_train, y_train.values.ravel()) then I cannot get my weighted scores to work. I have tried editing the code in different/appropriate ways to get the weighted scores to work but to no avail.
I am including my code below that runs on a test data in test.csv. The file has four columns: two feature columns ('x1', 'x2'), target ('y') and weight ('weight') columns. The custom scoring functions below are simple functions that calculate weighted rmse_score and mean_abs_error_score. How can I use .fit(X_train, y_train.values.ravel()) and still compute the scores?
import pandas as pd
import numpy as np
import sklearn.model_selection as skms
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
def rmse_score(y_true, y_pred, weight):
weight = weight.loc[y_true.index.values]
rmse = np.sqrt(np.mean(weight*(y_true.values-y_pred)**2))
return rmse
def mean_abs_error_score(y_true, y_pred, weight):
weight = weight.loc[y_true.index.values]
mae = np.mean(weight*np.absolute(y_true.values-y_pred))
return mae
#---- reading data
heart_df = pd.read_csv('data\\test.csv')
#---- splitting into training & testing sets
y = heart_df['y']
X = heart_df[['x1', 'x2']]
X_train, X_test, y_train, y_test = skms.train_test_split(X, y, test_size=0.20)
X_train_weights = heart_df['weight'].loc[X_train.index.values]
params = {"weight": X_train_weights}
my_scorer1 = make_scorer(rmse_score, greater_is_better=False, **params)
my_scorer2 = make_scorer(mean_abs_error_score, greater_is_better=False, **params)
#---- random forest training with hyperparameter tuning
pipe = Pipeline([("scaler", StandardScaler()), ("rfr", RandomForestRegressor())])
random_grid = { "rfr__n_estimators": [10, 100, 500, 1000],
"rfr__max_depth": [10, 20, 30, 40, 50, None],
"rfr__max_features": [0.25, 0.50, 0.75],
"rfr__min_samples_split": [5, 10, 20],
"rfr__min_samples_leaf": [3, 5, 10],
"rfr__bootstrap": [True, False]
}
rfr_cv = skms.RandomizedSearchCV(pipe,
param_distributions=random_grid,
n_iter = 15,
cv = 3,
verbose=3,
scoring={'rmse': my_scorer1, 'mae':my_scorer2},
refit = 'rmse',
random_state=42,
n_jobs = -1)
rfr_cv.fit(X_train, y_train)
best_params = rfr_cv.best_params_
best_score = rfr_cv.best_score_
print(f'best hyperparameters = {best_params}')
print(f'best score = {best_score}')

Why does the standardscaler have different effects under different number of features

I experimented with breast cancer data from scikit-learn.
Use all features and not use standardscaler:
cancer = datasets.load_breast_cancer()
x = cancer.data
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))
result 1 : 0.9473684210526315
Use all features and use standardscaler:
cancer = datasets.load_breast_cancer()
x = cancer.data
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
sc=StandardScaler()
sc.fit(x_train)
x_train=sc.transform(x_train)
x_test=sc.transform(x_test)
pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))
result 2 : 0.9736842105263158
Use only two features and not use standardscaler:
cancer = datasets.load_breast_cancer()
x = cancer.data[:,[27,22]]
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))
result 3 : 0.37719298245614036
Use only two features and use standardscaler:
cancer = datasets.load_breast_cancer()
x = cancer.data[:,[27,22]]
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
sc=StandardScaler()
sc.fit(x_train)
x_train=sc.transform(x_train)
x_test=sc.transform(x_test)
pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))
result 4 : 0.9824561403508771
As result1, result2, result3, result4 show , accuracy has much improvement with Standardscaler while training with fewer features.
So I wondering Why does the standardscaler have different effects under different number of features?
PS. Here is the two featrues I choose:
TL;DR
Don't do feature selection as long as you do not understand fully why you're doing it and in which way it may assist your algo in learning and generalizing better. For starter, please read http://www.feat.engineering/selection.html by Max Kuhn
Full read.
I suspect you tried to select a best feature subset and encountered a situation where a [arbitrary] subset performed better than the whole dataset. StandardScaling is out of question here because it's considered a standard preprocessing procedure for the algo of yours. So your real question should be "Why a subset of features perform better than a full dataset?"
Why your selection algo is arbitrary? 2 reasons.
First. Nobody has proven most linearly correlated feature would improve your [or any other if you wish] algo. Second. The best feature subset is different from what is necessitated by best correlated features.
Let's see this with code.
A feature subset giving best accuracy (note a)
Lets do a brute force.
acc_bench = 0.9736842105263158 # accuracy on all features
res = {}
f = x_train.shape[1]
pcpt = Perceptron(n_jobs=-1)
from itertools import combinations
for i in tqdm(range(2,10)):
features_list = combinations(range(f),i)
for features in features_list:
pcpt.fit(x_train[:,features],y_train)
preds = pcpt.predict(x_test[:, features])
acc = accuracy_score(y_test, preds)
if acc > acc_bench:
acc_bench = acc
res["accuracy"] = acc_bench
res["features"] = features
print(res)
{'accuracy': 1.0, 'features': (0, 15, 22)}
So you see, that features [0,15,22] give perfect accuracy over validation dataset.
Do best features have anything to do with correlation to target?
Let's find a list orderd by a degree of linear correlation.
featrues = pd.DataFrame(cancer.data, columns=cancer.feature_names)
target = pd.DataFrame(cancer.target, columns=['target'])
cancer_data = pd.concat([featrues,target], axis=1)
features_list = np.argsort(np.abs(cancer_data.corr()['target'])[:-1].values)[::-1]
feature_list
array([27, 22, 7, 20, 2, 23, 0, 3, 6, 26, 5, 25, 10, 12, 13, 21, 24,
28, 1, 17, 4, 8, 29, 15, 16, 19, 14, 9, 11, 18])
You see, that best feature subset found by brute force has nothing to do with correlation.
Can linear correlation explain accuracy of Perceptron?
Let's try plotting num of feature from the above list (starting with 2 most correlated) vs resulting accuracy.
res = dict()
for i in tqdm(range(2,10)):
features=features_list[:i]
pcpt.fit(x_train[:,features],y_train)
preds = pcpt.predict(x_test[:, features])
acc = accuracy_score(y_test, preds)
res[i]=[acc]
pd.DataFrame(res).T.plot.bar()
plt.ylim([.9,1])
Once again, linear correlated features have nothing to do with perceptron accuracy.
Conclusion.
Don't select feature prior to any algo unless you're perfectly sure what you're doing and what would be the effects of doing this. Do not mix up diffrent selection and learning algos because different algos have diferent opinions of what is important and what is not. A feature unimportant for one algo may become important for another. This is especially true for linear vs nonlinear algos.
If you want to improve accuracy of your algo do data cleaning or feature engineering instead.

Having an SVR model with one dimensional input vector and 2 dimensional output vector

I currently have two datasets. One gives me an output rational number corresponding to different input numbers. And the other gives me an output integer corresponding to the same input vector. The data looks pretty much like this -
X (input) = 0, 5, 10, 15, 20, 25
Y1 (output 1) = 0.2, 0.4, 0.7, 1.1, 1.5, 1.9
Y2 (output 2) = 45, 47, 51, 60, 90, 100
While I have been successfully able to train two distinct SVR models using SVR from sklearn.svm as follows -
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1)
regressor.fit(X, Y1)
Y1_rbf = svr_rbf.fit(X, Y1).predict(X)
regressor.fit(X, Y2)
Y2_rbf = svr_rbf.fit(X, Y2).predict(X)
Is there a way for me to have multidimensional output using SVR? Like input vector as X and output vector as - [Y1, Y2]? No specific reason - I just want to reduce the amount of code and make everything concise.
P. S. I looked into this - https://github.com/nwtgck/multi-svr-python, this is not what I need.
A good option would definitely be to use the sklearn.multioutput module and their offered regression and classification models.
They basically take a base estimator (SVR in your case) and use it to predict multiple labels. Depending on the actual model, this is achieved in different ways.The MultiOutputRegressor for instance fits one regressor per target.
Its use would definitely make the code more concise:
import numpy as np
from sklearn.multioutput import MultiOutputRegressor
from sklearn.svm import SVR
X = np.asarray([0, 5, 10, 15, 20, 25]).reshape(-1, 1)
y = np.asarray([[0.2, 45], [0.4, 47], [0.7, 51], [1.1, 60], [1.5, 90], [1.9, 100]])
regressor = MultiOutputRegressor(SVR(kernel='rbf', C=1e3, gamma=0.1))
regressor.fit(X, y)

How do I create multiple regression models (statsmodel) in subsets of a pandas data frame using a for loop or conditon?

How do I create multiple regression models (statsmodel) in subsets of a pandas data frame using a for loop or conditon?
I have a datframe which has one variable state that has 51 unique values. I have to make a model for each state. For some reason I am limited to regression(statsmodel)
lets say with variable V1 to be predicted by variables X1 , X2, X3
State is 1:51 and will be used as condition to split that dataframe
How can I automate this task using a for loop ?
Assuming you are only concerned with looping and not splitting the dataframes into 51 subparts, here is my attempt to your question:
Lets say, you define your OLS function as:
def OLSfunction(y):
y_train = traindf[y]
y_test = testdf[y]
from statsmodels.api import OLS
x_train = x_traindf
x_test = x_testdf
model = OLS(y_train, x_train)
result = model.fit()
print (result.summary())
pred_OLS = result.predict(x_test)
print("R2", r2_score(y_test, pred_OLS))
Y_s = ['1','2',.....'51']
for y in Y_s:
y=y
OLSfunction(y)
Please note you will have to have your traindf and testdf appropriately derived for the specific Y you are looking to build the model.
And these will have to be correctly passed into the OLSfunction.
Since I do not have any view of how your data looks like, am not getting into splitting/creation of traindf/ testdf...
import pandas as pd
import os as os
import numpy as np
import statsmodels.formula.api as sm
First I created a dict to hold the 51 datasets
d = {}
for x in range(0, 52):
d[x]=ccf.loc[ccf['state'] == x]
d.keys()
dict_keys([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51])
To check
d[1].head()
Then I ran the code in a loop using position in the dict
results={}
for x in range(1, 51):
results[x] = sm.Logit(d[x].fraudRisk, d[x][names]).fit().summary2()
However I felt I should use the multiple classifiers in sklearn. First I need to split the data as pointed above.
from sklearn.model_selection import train_test_split
# Multiple Models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
#Model Metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
lr={}
gnb={}
svc={}
rfc={}
classifier={}
regr_1={}
regr_2={}
import datetime
datetime.datetime.now()
for x in range(1, 51):
X_train, X_test, y_train, y_test = train_test_split(d[x][names], d[x].fraudRisk, test_size=0.3)
print(len(X_train))
print(len(y_test))
# Create classifiers
lr[x] = LogisticRegression().fit(X_train, y_train).predict(X_test)
gnb[x] = GaussianNB().fit(X_train, y_train).predict(X_test)
svc[x] = LinearSVC(C=1.0).fit(X_train, y_train).predict(X_test)
rfc[x] = RandomForestClassifier(n_estimators=1).fit(X_train, y_train).predict(X_test)
classifier[x] = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train).predict(X_test)
print(datetime.datetime.now())
print("Accuracy Score for model for state ",x, 'is ')
print('LogisticRegression',accuracy_score(y_test,lr[x]))
print('GaussianNB',accuracy_score(y_test,gnb[x]))
print('LinearSVC',accuracy_score(y_test,svc[x]))
print('RandomForestClassifier',accuracy_score(y_test,rfc[x]))
print('KNeighborsClassifier',accuracy_score(y_test,classifier[x]))
print("Classification Report for model for state ",x, 'is ')
print('LogisticRegression',classification_report(y_test,lr[x]))
print('GaussianNB',classification_report(y_test,gnb[x]))
print('LinearSVC',classification_report(y_test,svc[x]))
print('RandomForestClassifier',classification_report(y_test,rfc[x]))
print('KNeighborsClassifier',classification_report(y_test,classifier[x]))
print("Confusion Matrix Report for model for state ",x, 'is ')
print('LogisticRegression',confusion_matrix(y_test,lr[x]))
print('GaussianNB',confusion_matrix(y_test,gnb[x]))
print('LinearSVC',confusion_matrix(y_test,svc[x]))
print('RandomForestClassifier',confusion_matrix(y_test,rfc[x]))
print('KNeighborsClassifier',confusion_matrix(y_test,classifier[x]))
print("Area Under Curve for model for state ",x, 'is ')
print('LogisticRegression',roc_auc_score(y_test,lr[x]))
print('GaussianNB',roc_auc_score(y_test,gnb[x]))
print('LinearSVC',roc_auc_score(y_test,svc[x]))
print('RandomForestClassifier',roc_auc_score(y_test,rfc[x]))
print('KNeighborsClassifier',roc_auc_score(y_test,classifier[x]))
Took a long time for 5 models X 51 states with multiple metrics but was worth it. Let me know if there is a faster or better way to write more elegant and less hacky code

Categories