I'm trying to implement the use of KNearestNeighbors on the MNIST example dataset.
When trying to use cross_val_predict the script just continues to run no matter how long I leave it for.
Is there something I am missing/doing wrong?
Any feedback is appreciated.
from sklearn.datasets import fetch_openml
import numpy as np
mnist = fetch_openml('mnist_784', version=1) #Imports the dataset into the notebook
X, y = mnist["data"], mnist["target"]
y=y.astype(np.uint8)
X=X.astype(np.uint8)#For machine learning models to understand the output must be casted to an interger not a string.
X.shape, y.shape
y=y.astype(np.uint8) #For machine learning models to understand the output must be casted to an interger not a string.
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] #Separate the data into training and testing sets
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3)
f1_score(y_train, y_train_knn_pred, average="macro")
Use the n_jobs=-1
The number of CPUs to use to do the computation. None means 1 unless
in a joblib.parallel_backend context. -1 means using all processors
from sklearn.datasets import fetch_openml
import numpy as np
mnist = fetch_openml('mnist_784', version=1) #Imports the dataset into the notebook
X, y = mnist["data"], mnist["target"]
y=y.astype(np.uint8)
X=X.astype(np.uint8)#For machine learning models to understand the output must be casted to an interger not a string.
y=y.astype(np.uint8) #For machine learning models to understand the output must be casted to an interger not a string.
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:] #Separate the data into training and testing sets
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_jobs=-1) # HERE
knn_clf.fit(X_train, y_train) # this took seconds on my macbook pro
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3, n_jobs=-1) # AND HERE
f1_score(y_train, y_train_knn_pred, average="macro")
I think the confusion comes from the point that the KNN algorithm fit call is so much faster than the prediction. From another SO post:
Why is cross_val_predict so much slower than fit for KNeighborsClassifier?
KNN is also called as lazy algorithm because during fitting it does
nothing but saves the input data, specifically there is no learning at
all.
During predict is the actual distance calculation happens for each
test datapoint. Hence, you could understand that when using
cross_val_predict, KNN has to predict on the validation data points,
which makes the computation time higher!
Therefore a lot of computation power is needed when you look at the size of your input. data. Using multiple cpus or minimizing dimension could be useful.
If you want to use multiple CPU cores you can pass the argument "n_jobs" to cross_val_predict and to KNeighborsClassifierto set the amount of cores to be used. Set it to -1 to use all cores available
Related
I am using smote to balanced the output (y) only for Model train but want to test the model with original data as it makes logic how we can test the model with smote created outputs. Please ask anything for clarification if I didn't explained it well. It's my starting on Stack overflow.
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X_sm, y_sm = oversample.fit_resample(X, y)
# Splitting Dataset into Train and Test (Smote)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm,test_size=0.2,random_state=42)
Here i applied the Random Forest Classifier on my data
import math
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
# RF = RandomForestClassifier(n_estimators=100)
# RF.fit(X_train, y_train.values.ravel())
# y_pred = RF.predict(X)
# print(metrics.classification_report(y,y_pred))
RF = RandomForestClassifier(n_estimators=10)
RF.fit(X_train, y_train.values.ravel())
If i applied this but X also contains the data which we used for train. how we can remove the data which we already used for training the data.
y_pred = RF.predict(X)
print(metrics.classification_report(y,y_pred))
I used SMOTE in the past, it is suboptimal. Lately, researchers have proven some flaws in the generated distribution of Synthetic Minority Oversample Technique (SMOTE). I know sometimes we don't have a choice regarding the unbalanced classes, but you can use sklearn.ensemble.RandomForestClassifier, where you can define a proper class_weight to handle the unbalanced class problem.
Check scikit-learn documentation:
Scikit-documentation
I agree with razimbres about using class_weight.
Another option for you would be to split the dataset into train and test first. Then, keep the test set aside. Use only the training set from here on:
X_sm, y_sm = oversample.fit_resample(X_train, y_train)
.
.
.
I'm currently using cross_val_score and KFold to assess the impact of using StandardScaler at different points within data pre-processing, specifically whether scaling the entire training dataset prior to performing cross validation introduces data leakage and what the effect of this is when compared to scaling the data from within a Pipeline (and therefore only applying it to the training folds).
my current process is as follows:
Experiment A
Import the boston housing dataset from sklearn.datasets and split into Data (X) and target (y)
create a Pipeline (sklearn.pipeline), that applies StandardScaler before applying linear regression
Specify the cross validation method as KFold with 5 folds
Perform cross validation (cross_val_score) using the above Pipeline and KFold method and observe the score
Experiment B
Use the same boston housing data as above
fit_transform StandardScaler on the entire dataset
Use cross_val_Score to perform cross validation on again 5 folds but this time input LinearRegression directly rather than a pipeline
Compare the scores here to Experiment A
The scores obtained are identical (to around 13 decimal places) which I question as surely Experiment B introduces Data Leakage during cross validation.
I've seen posts stating that it doesnt matter whether scaling is done on the entire training set before cross validation, if this is true I'm looking to understand why, if this isn't true I'd like to understand why the scores can still be so similar despite the data leakage?
See my test code below:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.linear_model import LinearRegression
np.set_printoptions(15)
boston = datasets.load_boston()
X = boston["data"]
y = boston["target"]
scalar = StandardScaler()
clf = LinearRegression()
class StScaler(StandardScaler):
def fit_transform(self,X,y=None):
print('Length of Data on which scaler is fit on =', len(X))
output = super().fit(X,y)
# print('mean of scalar =',output.mean_)
output = super().transform(X)
return output
pipeline = Pipeline([('sc', StScaler()), ('estimator', clf)])
cv = KFold(n_splits=5, random_state=42)
cross_val_score(pipeline, X, y, cv = cv)
# Now fitting Scaler on whole train data
scaler_2 = StandardScaler()
clf_2 = LinearRegression()
X_ss = scaler_2.fit_transform(X)
cross_val_score(clf_2, X_ss, y, cv=cv)
Thanks!
i have more than 2000 data sets for ANN. I have applied MLPRegressor in it. My code is working fine. But for testing, i want to fix my testing value for instance i have 50 data sets. From that i want to test first 20 value. How do I fix this in the code? I have used the following code.
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neural_network import MLPRegressor
df = pd.read_csv("0.5-1.csv")
df.head()
X = df[['wavelength', 'phase velocity']]
y = df['shear wave velocity']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import mean_absolute_error
mlp = MLPRegressor(hidden_layer_sizes=(30,30,30))
mlp.fit(X_train,y_train)
If you want this for reproducible results, you can pass train_test_split a fix random seed so that in each run, same train/test samples are used. The benefit if using train_test_split would be to choose train/test splits nice and clean with no further effort.
But if you insist on manually choosing train/test split like you said, you can do it this way:
X_test, y_test = X[:20], y[:20] # first 20 samples for test
X_train, y_train = X[20:], y[20:] # rest of samples for train
fix the random seed for numpy as 48 or something else
np.random.seed(48)
this will generate identical splits every time. And use testsize for fixing the size of the split
I am trying to fine tune my sklearn models using train_test_split strategy. I am aware of GridSearchCV's ability to perform parameter tuning, however, it was tied to using Cross Validation strategy, I would like to use train_test_split strategy for the parameter searching, for the speed of training is important for my case, I prefer simple train_test_split over cross-validation.
I could try to write my own for loop, but it would be inefficient for not taking advantage of the built-in parallelization used in GridSearchCV.
Anyone knows how to take advantage GridSearchCV for this? Or provide an alternative that wasn't too slow.
Yes, you can use ShuffleSplit for this.
ShuffleSplit is a cross validation strategy like KFold, but unlike KFold where you have to train K models, here you can control how many times to do the train/test split, even once if you prefer.
shuffle_split = ShuffleSplit(n_splits=1,test_size=.25)
n_splits defines how many times to repeat this splitting and training routine.
Now you can use it like this:
GridSearchCV(clf,param_grid={},cv=shuffle_split)
I would like to add on to Shihab Shahriar's answer, by providing a code sample.
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.ensemble import RandomForestClassifier
# Load iris dataset
iris = datasets.load_iris()
# Prepare X and y as dataframe
X = pd.DataFrame(data=iris.data, columns=iris.feature_names)
y = pd.DataFrame(data=iris.target, columns=['Species'])
# Train test split
shuffle_split = ShuffleSplit(n_splits=1, test_size=0.3)
# This is equivalent to:
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# But, it is usable for GridSearchCV
# GridSearch without CV
params = { 'n_estimators': [16, 32] }
clf = RandomForestClassifier()
grid_search = GridSearchCV(clf, param_grid=params, cv=shuffle_split)
grid_search.fit(X, y)
This should help anyone facing a similar problem.
I'm working on an example of applying Restricted Boltzmann Machine on Iris dataset. Essentially, I'm trying to make a comparison between RMB and LDA. LDA seems to produce a reasonable correct output result, but the RBM isn't. Following a suggestion, I binarized the feature inputs using skearn.preprocessing.Binarizer, and also tried different threshold parameter values. I tried several different ways to apply binarization, but none seemed to work for me.
Below is my modified version of the code based on this user's version User: covariance.
Any helpful comments are greatly appreciated.
from sklearn import linear_model, datasets, preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
from sklearn.lda import LDA
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:,:2] # we only take the first two features.
Y = iris.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=10)
# Models we will use
rbm = BernoulliRBM(random_state=0, verbose=True)
binarizer = preprocessing.Binarizer(threshold=0.01,copy=True)
X_binarized = binarizer.fit_transform(X_train)
hidden_layer = rbm.fit_transform(X_binarized, Y_train)
logistic = linear_model.LogisticRegression()
logistic.coef_ = hidden_layer
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
lda = LDA(n_components=3)
#########################################################################
# Training RBM-Logistic Pipeline
logistic.fit(X_train, Y_train)
classifier.fit(X_binarized, Y_train)
#########################################################################
# Get predictions
print "The RBM model:"
print "Predict: ", classifier.predict(X_test)
print "Real: ", Y_test
print
print "Linear Discriminant Analysis: "
lda.fit(X_train, Y_train)
print "Predict: ", lda.predict(X_test)
print "Real: ", Y_test
RBM and LDA are not directly comparable, as RBM doesn't perform classification on its own. Though you are using it as a feature engineering step with logistic regression at the end, LDA is itself a classifier - so the comparison isn't very meaningful.
The BernoulliRBM in scikit learn only handles binary inputs. The iris dataset has no sensible binarization, so you aren't going to get any meaningful outputs.