For Loop In Python using sklearn.model_selection.train_test_split - python

I need to create a FOR loop in Python that will repeat steps 1-2 1,00 times.
Split sample randomly into training test using a 632:368 ratio.
Build the model using the 63.2% training data and compute R square in holdout data.
I can't seem to grab the R square for the dataset :
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(10_00):
X_train, X_test, y_train, y_test = train_test_split(xall, y,
test_size=0.382,
random_state=seed)
modelall = LinearRegression()
modelall.fit(xall, y)
modelall = LinearRegression().fit(xall, y)
r_sq = modelall.score(xall, y)
print('coefficient of determination:', r_sq)

Fit the model using the TRAINING data and estimate the score using the TEST data.
Use this:
y=data['Amount']
xall = data
xall.drop(["No","Amount", "Class"], axis = 1, inplace = True)
for seed in range(100):
X_train, X_test, y_train, y_test = train_test_split(xall, y, test_size=0.382, random_state=seed)
modelall = LinearRegression()
modelall.fit(X_train, y_train)
r_sq = modelall.score(X_test, y_test)
print('coefficient of determination:', r_sq)

You are fitting a linear model to the whole dataset (xall) with a different seed number. Linear regression should give you the same output irrespective of the seed value.

Related

Why does my PCA change every time I run the code in python?

I imputed my dataframe of any missing values with the median of each feature and scaled using StandardScaler(). I ran regular kneighbors with n=3 and the accuracy stays consistent.
Now I am to do the PCA of the resulting dataset with n_components=4 and apply K-neighbors with 3 neighbors. However, every time I run my code, the PCA dataset and kneighbors accuracy changes every time I run the program but the master dataset itself doesn't change. I even tried using first 4 features of the dataset when applying kneighbors and even that is inconsistent.
data = pd.read_csv('dataset.csv')
y = merged['Life expectancy at birth (years)']
X_train, X_test, y_train, y_test = train_test_split(data,
y,
train_size=0.7,
test_size=0.3,
random_state=200)
for i in range(len(features)):
featuredata = X_train.iloc[:,i]
fulldata = data.iloc[:,i]
fulldata.fillna(featuredata.median(), inplace=True)
data.iloc[:,i] = fulldata
scaler = preprocessing.StandardScaler().fit(X_train)
data = scaler.transform(data)
If I apply KNeighbors here, it runs fine, and my accuracy score remains the same.
pcatest = PCA(n_components=4)
pca_data = pcatest.fit_transform(data)
X_train, X_test, y_train, y_test = train_test_split(pca_data,
y,
train_size=0.7,
test_size=0.3)
pca = neighbors.KNeighborsClassifier(n_neighbors=3)
pca.fit(X_train, y_train)
y_pred_pca = pca.predict(X_test)
pca_accuracy = accuracy_score(y_test, y_pred_pca)
However, my pca_accuracy score changes every time I run the code. What can I do to make it set and consistent?
first4_data = data[:,:4]
X_train, X_test, y_train, y_test = train_test_split(first4_data,
y,
train_size=0.7,
test_size=0.3)
first4 = neighbors.KNeighborsClassifier(n_neighbors=3)
first4.fit(X_train, y_train)
y_pred_first4 = first4.predict(X_test)
first4_accuracy = accuracy_score(y_test, y_pred_first4)
I am only taking the first 4 features/columns and the data should remain the same, but for some reason, the accuracy score changes everytime I run it.
You need to give random_statea value in train_test_split otherwise everytime you run it without specifying random_state, you will get a different result. What happens is that every time you split your data, you do it in different ways, unless you specify a random state, or lack there of. It's the equivalent of seed() in R.

How to implement Gaussian Naive Bayes in two training sets

how am I supposed to implement Gaussian Naive Bayes, in two training sets.
I need:
Create a training set by selecting the rows with id <= 160
Train a Gaussian Naive-Bayes classifier as we saw in class to determine if a campaign will be successful, given the amounts used in each marketing channel
Calculate the fraction of the training set that is correctly classified.
and:
Create a test set by selecting the rows with id> 160
Evaluate the performance of the classifier as follows:
What percentage of the test set was classified
correctly (correct answers on the total)? It is desirable that this number reaches at least 80%
What is the ratio of false positives to false negatives?
Successful marketing campaign:
successful_marketing_campaign = (dataset['sales'] > 15) | (dataset['total_invested'] < 20)
And my code:
X = dataset.iloc[:, [0, 3]].values.astype('int')
y = dataset.iloc[:, [4]].values.astype('int')
X_train = dataset.iloc[0:160, [0, 3]].values.astype('int')
y_train = dataset.iloc[0:160, 4].values.astype('int')
X_test = dataset.iloc[160:, [0, 3]].values.astype('int')
y_test = dataset.iloc[160:, 4].values.astype('int')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = GaussianNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
matrix = confusion_matrix(y_test, y_pred)
print(matrix)

Obtain errors of individual data points when using cross-validation (scikit-learn)

I am using cross-validation to evaluate my ML models but now I want to look into the distribution of the errors, i.e. I want to get the average error of specific data points whenever they are in the test set.
from sklearn import linear_model
from sklearn.model_selection import KFold, cross_val_score
X = #data points
y = #output
lm = linear_model.LinearRegression()
kfold = KFold(n_splits=10)
scores = cross_val_score(lm, X, y, scoring='neg_mean_squared_error', cv=kfold)
rmse_scores = [np.sqrt(abs(s)) for s in scores]
print('Testing RMSE (lin reg): {:.3f}'.format(np.mean(rmse_scores)))
Is there an easy way to get the individual errors of each of the data points whenever they are in the test set (not training error) using cross-validation with scikit-learn?
Thank you!
If I understood your question correctly, this should be what you are looking for.
kf = KFold(n_splits=3)
error = []
for train_index, val_index in kf.split(X, y):
Xtrain, X_val = X[train_index], X[val_index]
ytrain, y_val = y[train_index], y[val_index]
model.fit(Xtrain, ytrain)
pred = model.predict(X_val)
current_error = mean_squared_error(y_val, pred) # error per iteration
error.append(current_error)
print(np.mean(error)) # get mean error after CV

Why two different AUC scores are produced when evaluated on same data and same algorithm

I am working on a classification problem whose evaluation metric in ROC AUC. So far I have tried using xgb with different parameters. Here is the function which I used to sample the data. And you can find the relevant notebook here (google colab)
def get_data(x_train, y_train, shuffle=False):
if shuffle:
total_train = pd.concat([x_train, y_train], axis=1)
# generate n random number in range(0, len(data))
n = np.random.randint(0, len(total_train), size=len(total_train))
x_train = total_train.iloc[n]
y_train = total_train.iloc[n]['is_pass']
x_train.drop('is_pass', axis=1, inplace=True)
# keep the first 1000 rows as test data
x_test = x_train.iloc[:1000]
# keep the 1000 to 10000 rows as validation data
x_valid = x_train.iloc[1000:10000]
x_train = x_train.iloc[10000:]
y_test = y_train[:1000]
y_valid = y_train[1000:10000]
y_train = y_train.iloc[10000:]
return x_train, x_valid, x_test, y_train, y_valid, y_test
else:
# keep the first 1000 rows as test data
x_test = x_train.iloc[:1000]
# keep the 1000 to 10000 rows as validation data
x_valid = x_train.iloc[1000:10000]
x_train = x_train.iloc[10000:]
y_test = y_train[:1000]
y_valid = y_train[1000:10000]
y_train = y_train.iloc[10000:]
return x_train, x_valid, x_test, y_train, y_valid, y_test
Here are the two outputs that I get after running on shuffled and non shuffled data
AUC with shuffling: 0.9021756235738453
AUC without shuffling: 0.8025162142685565
Can you find out what's the issue here ?
The problem is that in your implementation of shuffling- np.random.randint generates random numbers, but they can be repeated, thus you have the same events appearing in your train and test+valid sets. You should use np.random.permutation instead (and consider to use np.random.seed to ensure reproducibility of the outcome).
Another note- you have very large difference in performance between training and validation/testing sets (the training shows almost perfect ROC AUC). I guess, this is due to too high max depth of the tree (14) that you allow for the size of the dataset (~60K) that you have in hand
P.S. Thanks for sharing collaboratory link- I was not aware of it, but it is very useful.

Making scikit deterministic?

I'm using scikit-learn to train some classifiers. I do cross validation and then compute AUC. However I'm getting a different AUC number every time I run the tests although I made sure to use a seed and a RandomState. I want my tests to be deterministic. Here's my code:
from sklearn.utils import shuffle
SEED = 0
random_state = np.random.RandomState(SEED)
X, y = shuffle(data, Y, random_state=random_state)
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, y, test_size=test_size, random_state=random_state)
clf = linear_model.LogisticRegression()
kfold = cross_validation.KFold(len(X), n_folds=n_folds)
mean_tpr = 0.0
mean_fpr = np.linspace(0, 1, 100)
for train, test in kfold:
probas_ = clf.fit(X[train], Y[train]).predict_proba(X[test])
fpr, tpr, thresholds = roc_curve(Y[test], probas_[:, 1])
mean_tpr += interp(mean_fpr, fpr, tpr)
mean_tpr[0] = 0.0
mean_tpr /= len(kfold)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
My questions:
1- Is there something wrong in my code that's making the results different each time I run it?
2- Is there a global way to make scikit deterministic?
EDIT:
I just tried this:
test_size = 0.5
X = np.random.randint(10, size=(10,2))
Y = np.random.randint(2, size=(10))
SEED = 0
random_state = np.random.RandomState(SEED)
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, Y, test_size=test_size, random_state=random_state)
print X_train # I recorded the result
Then I did:
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, Y, test_size=test_size, random_state=6) #notice the change in random_state
Then I did:
X_train, X_test, y_train, y_test = \
cross_validation.train_test_split(X, Y, test_size=test_size, random_state=random_state)
print X_train #the result is different from the first one!!!!
As you see I'm getting different results although I used the same random_state! How to solve this?
LogisticRegression uses randomness internally and has an (undocumented, will fix in a moment) random_state argument.
There's no global way of setting the random state, because unfortunately the random state on LogisticRegression and the SVM code can only be set in a hacky way. That's because this code comes from Liblinear and LibSVM, which use the C standard library's rand function and that cannot be seeded in a principled way.
EDIT The above is true, but probably not the cause of the problem. You're threading a single np.random.RandomState through your calls, while you should pass the same integer seed for easy reproducibility.

Categories