I am currently running a program in Jupyter notebook to classify MNIST dataset.
I am trying to use the KNN classifier to do this and it is taking more than an hour to run. I am new to classifiers and hyper-parameters and there does not seem to be any decent tutorials on how to properly implement one of them. Could anyone give me some tips on how to use a hyper-parameter for this classification? I have searched and seen GridSearchCv and RandomizedSearchCV. From viewing their examples it seems they select different attribute names and alter to the ones necessary for their code. I do not understand how this can be done for MNIST dataset if the data is just handwritten digits. Seeing that there is only digits could there not be a need for a hyperparameter in this case? This is my code that I am currently still running. Thank you for any help you can provide.
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals
# Common imports
import numpy as np
import os
# to make this notebook's output stable across runs
np.random.seed(42)
# To plot pretty figures
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"
def save_fig(fig_id, tight_layout=True):
image_dir = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
if not os.path.exists(image_dir):
os.makedirs(image_dir)
path = os.path.join(image_dir, fig_id + ".png")
print("Saving figure", fig_id)
if tight_layout:
plt.tight_layout()
plt.savefig(path, format='png', dpi=300)
def sort_by_target(mnist):
reorder_train = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[:60000])]))[:, 1]
reorder_test = np.array(sorted([(target, i) for i, target in enumerate(mnist.target[60000:])]))[:, 1]
mnist.data[:60000] = mnist.data[reorder_train]
mnist.target[:60000] = mnist.target[reorder_train]
mnist.data[60000:] = mnist.data[reorder_test + 60000]
mnist.target[60000:] = mnist.target[reorder_test + 60000]
try:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)
mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings
sort_by_target(mnist) # fetch_openml() returns an unsorted dataset
except ImportError:
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
mnist["data"], mnist["target"]
mnist.data.shape
X, y = mnist["data"], mnist["target"]
X.shape
y.shape
#select and display some digit from the dataset
import matplotlib
import matplotlib.pyplot as plt
some_digit_index = 7201
some_digit = X[some_digit_index]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap = matplotlib.cm.binary,
interpolation="nearest")
plt.axis("off")
save_fig("some_digit_plot")
plt.show()
#print some digit's label
print('The ground truth label for the digit above is: ',y[some_digit_index])
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
#random shuffle
import numpy as np
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]
from sklearn.model_selection import cross_val_predict
from sklearn.neighbors import KNeighborsClassifier
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
knn_clf.predict([some_digit])
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_multilabel, cv=3, n_jobs=-1)
f1_score(y_multilabel, y_train_knn_pred, average="macro")
The most popular hyperparameter for KNN would be n_neighbors, that is, how many nearest neighbors you consider to assign a label to a new point. By default, it is set to 5, but it might not be the best choice. Therefore it is often better to find what the best choice is for your specific problem.
This is how you would find the optimal hyperparameter for your example:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
param_grid = {"n_neighbors" : [3,5,7]}
KNN=KNeighborsClassifier()
grid=GridSearchCV(KNN, param_grid = param_grid , cv = 5, scoring = 'accuracy', return_train_score = False)
grid.fit(X_train,y_train)
What this does is comparing the performance of your KNN model with the different values of n_neighbors that you set. Then when you do:
print(grid.best_score_)
print(grid.best_params_)
it will show you what the best performance score was, and for which choice of parameters it was achieved.
All this has nothing to do with the fact that you are using MNIST data. You could use this approach for any other classification task, as long as you think KNN might be a sensible choice for your task (which might be arguable for image classification). The only thing that would change from one task to another is the optimal value of the hyperparameters.
PS: I would advise not to use the y_multilabel terminology as this might refer to a specific classification task where each data point could have several labels, which is not the case in MNIST (each image represents only one digit at at time).
Related
Paul need a laptop that is fast enough. One of the main parameter of computers which he must focus on is CPU. In this project we need to forecast performance of CPU which is characterized in terms of cycle time and memory capacity and so on.
It is Linear Regression problem and you should predict the Estimated Relative Performance column.
I am new in Python. Could anybody help me with the code for this task?
CSV file (on Google Drive)
This is what I have done. But probably I did not understand the case.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
data = pd.read_csv("Computer_Hardware.csv")
data
data.describe()
y = data["Machine Cycle Time in nanoseconds"]
x1 = data["Estimated Relative Performance"]
plt.scatter(x1,y)
plt.xlabel("Estimated Relative Performance", fontsize = 20)
plt.ylabel("Machine Cycle Time in nanoseconds", fontsize = 20)
plt.show()
x = sm.add_constant(x1)
x = sm.add_constant(x1)
results = sm.OLS(y,x).fit()
results.summary()
In any fitted model from statsmodels you can extract predicted values with method predict() and then add them to your frame.
data['predicted'] = results.predict()
Maybe your model needs more work, for now, it only uses a variable and maybe you will get a better prediction with another model that uses more variables.
y = b0 + b1 * x1
According to the text "... CPU which is characterized in terms of cycle time and memory capacity and so on" is the problem.
A proposal will be to extend your models using statsmodels API to write a formula. In your case I like to remove all spaces in columns names before.
# Rename columns without spaces
old_columns = data.columns
new_columns = [col.replace(' ', '_') for col in old_columns]
data = data.rename(columns={old:new for old, new in zip(old_columns, new_columns)})
# Fit a model using more variables
import statsmodels.formula.api as sm2
formula = ('Estimated_Relative_Performance ~ ',
'Machine_Cycle_Time_in_nanoseconds + ',
'Maximum_Main_Memory_in_kilobytes + ',
'Cache_Memory_in_Kilobytes + ',
'Maximum_Channels_in_Units')
formula = ' '.join(formula)
print(formula)
results2 = sm2.ols(formula, data).fit()
results2.summary()
data['predicted2'] = results2.predict()
Keras is one of the easiest libraries to use if you don't have much experience in Python. This is a good tutorial: https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
I studied some blogs regarding the topic, and came with this code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
raw_data = pd.read_csv("Computer_Hardware.csv")
x = raw_data[['Machine Cycle Time in nanoseconds',
'Minimum Main Memory in Kilobytes', 'Maximum Main Memory in kilobytes',
'Cache Memory in Kilobytes', 'Minimum Channels in Units',
'Maximum Channels in Units', 'Published Relative Performance']]
y = raw_data['Estimated Relative Performance']
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, y_train)
print(model.coef_)
print(model.intercept_)
pd.DataFrame(model.coef_, x.columns, columns = ['Coeff'])
predictions = model.predict(x_test)
plt.hist(y_test - predictions)
from sklearn import metrics
metrics.mean_absolute_error(y_test, predictions)
metrics.mean_squared_error(y_test, predictions)
np.sqrt(metrics.mean_squared_error(y_test, predictions))
I am trying to detect outlier images. But I'm getting bizarre results from the model.
I've read in the images with cv2, flattened them into 1d-arrays, and turned them into a pandas dataframe and then fed that into the SVM.
import numpy as np
import cv2
import glob
import pandas as pd
import sys, os
import matplotlib.pyplot as plt
from sklearn import svm
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import *
import seaborn as sns`
load the labels and files
labels_wt = np.loadtxt("labels_wt.txt", delimiter="\t", dtype="str")
files_wt = np.loadtxt("files_wt.txt", delimiter="\t", dtype="str")`
load and flatten the images
wt_images_tmp = [cv2.imread(file) for file in files_wt]
wt_images = [image.flatten() for image in wt_images_tmp]
tmp3 = np.array(wt_images)
mutant_images_tmp = [cv2.imread(file) for file in files_mut]
mutant_images = [image.flatten() for image in mutant_images_tmp]
tmp4 = np.array(mutant_images)
X = pd.DataFrame(tmp3) #load the wild-type images
y = pd.Series(labels_wt)
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2, random_state=42)
X_outliers = pd.DataFrame(tmp4)
clf = svm.OneClassSVM(nu=0.15, kernel="rbf", gamma=0.0001)
clf.fit(X_train)
Then I evaluate the results according to the sklearn tutorial on oneclass SVM.
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)
n_error_train = y_pred_train[y_pred_train == -1].size
n_error_test = y_pred_test[y_pred_test == -1].size
n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size
print(n_error_train / len(y_pred_train))
print(float(n_error_test) / float(len(y_pred_test)))
print(n_error_outliers / len(y_pred_outliers))`
my error rates on the training set have been variable (10-30%), but on the test set, they have never gone below 100%. Am I doing this wrong?
My guess is that you are setting random_state = 42, this is biasing your train_test_split to always have the same splitting pattern. You can read more about it in this answer. Don't specify any state and run the code again, so:
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.2)
This will show different results. Once you are sure this works, make sure yo then do cross-validation, possibly using k-fold validation. Let us know if this helps.
I have a multi-class classification problem with 9 different classes. I am using the AdaBoostClassifier class from scikit-learn to train my model without using the one vs all technique, as the number of classes is very high and it might be inefficient.
I have tried using the tips from the documentation in scikit learn [1], but there the one vs all technique is used, which is substantially different. In my approach I only get one prediction per event, i.e. if I have n classes, the outcome of the prediction is a single value within the n classes. For the one vs all approach, on the other hand, the outcome of the prediction is an array of size n with a sort of likelihood value per class.
[1]
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
The code is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Matplotlib plotting library for basic visualisation
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
# Read data
df = pd.read_pickle('data.pkl')
# Create the dependent variable class
# This will substitute each of the n classes from
# text to number
factor = pd.factorize(df['target_var'])
df.target_var= factor[0]
definitions = factor[1]
X = df.drop('target_var', axis=1)
y = df['target_var]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
bdt_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=250,
learning_rate=0.3)
bdt_clf.fit(X_train, y_train)
y_pred = bdt_clf.predict(X_test)
#Reverse factorize (converting y_pred from 0s,1s, 2s, etc. to their original values
reversefactor = dict(zip(range(9),definitions))
y_test_rev = np.vectorize(reversefactor.get)(y_test)
y_pred_rev = np.vectorize(reversefactor.get)(y_pred)
I tried directly with the roc curve function, and also binarising the labels, but I always get the same error message.
def multiclass_roc_auc(y_test, y_pred):
lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test = lb.transform(y_test)
y_pred = lb.transform(y_pred)
return roc_curve(y_test, y_pred)
multiclass_roc_auc(y_test, y_pred_test)
The error message is:
ValueError: multilabel-indicator format is not supported
How could this be sorted out? Am I missing some important concept?
I have the following code so far:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df_train = pd.read_csv('uc_data_train.csv')
del df_train['Unnamed: 0']
temp = df_train['size_womenswear']
del df_train['size_womenswear']
df_train['size_womenswear'] = temp
df_train['count'] = 1
print(df_train.head())
print(df_train.dtypes)
print(df_train[['size_womenswear', 'count']].groupby('size_womenswear').count()) # Determine number of unique catagories, and number of cases for each catagory
del df_train['count']
df_test = pd.read_csv('uc_data_test.csv')
del df_test['Unnamed: 0']
print(df_test.head())
print(df_test.dtypes)
df_train.drop(['customer_id','socioeconomic_status','brand','socioeconomic_desc','order_method',
'first_order_channel','days_since_first_order','total_number_of_orders', 'return_rate'], axis=1, inplace=True)
LE = preprocessing.LabelEncoder() # Create label encoder
df_train['size_womenswear'] = LE.fit_transform(np.ravel(df_train[['size_womenswear']]))
print(df_train.head())
print(df_train.dtypes)
x = df_train.iloc[:,np.arange(len(df_train.columns)-1)].values # Assign independent values
y = df_train.iloc[:,-1].values # and dependent values
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.25, random_state = 0) # Testing on 75% of the data
model = GaussianNB()
model.fit(xTrain, yTrain)
yPredicted = model.predict(xTest)
#print(yPrediction)
print('Accuracy: ', accuracy_score(yTest, yPredicted))
I am not sure how to include the data that I am using but I am trying to predict the 'size_womenswear'. There are 8 different sizes that I have encoded to predict and I have moved this column to the end of the dataframe. so y is the dependent and x are the independent (all the other columns)
I am using a Gaussian Naive Bayes classifier to try and classify the 8 different sizes and then test on 25% of the data. The results are not very good.
I don't know why I am only getting an accuracy of 61% when I am working with 80,000 rows. I am very new to Machine Learning and would appreciate any assistance. Is there a better method that I could use in this case than Gaussian Naive Bayes?
can't comment, just throwing out some ideas;
Maybe you need to deal with class imbalance, and try other model that will fit the data better? try the xgboost or lightgbm package given good data they usually perform pretty good in general, but it really depends on the data.
Also the way you split train and test, does the resulting train and test data set has similar distribution for your Y? that's very important.
Last thing, for classification models the performance measurement can be a bit tricky, try some other measurement methods. F1 scores or try to draw a confusion matrix and see what your predictions vs Y looks like. perhaps your model is predicting everything to one
or just a few classes.
For the same dataset (here Bupa) and parameters i get different accuracies.
What did I overlook?
R implementation:
data_file = "bupa.data"
dataset = read.csv(data_file, header = FALSE)
nobs <- nrow(dataset) # 303 observations
sample <- train <- sample(nrow(dataset), 0.95*nobs) # 227 observations
# validate <- sample(setdiff(seq_len(nrow(dataset)), train), 0.1*nobs) # 30 observations
test <- setdiff(seq_len(nrow(dataset)), train) # 76 observations
svmfit <- svm(V7~ .,data=dataset[train,],
type="C-classification",
kernel="linear",
cost=1,
cross=10)
testpr <- predict(svmfit, newdata=na.omit(dataset[test,]))
accuracy <- sum(testpr==na.omit(dataset[test,])$V7)/length(na.omit(dataset[test,])$V7)
I get accuracy: 0.94
but when i do as following in python (scikit-learn)
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
import pandas as pd
from sklearn import svm, grid_search
f = open("data/bupa.data")
dataset = np.loadtxt(fname = f, delimiter = ',')
nobs = np.shape(dataset)[0]
print("Number of Observations: %d" % nobs)
y = dataset[:,6]
X = dataset[:,:-1]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.06, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
scores = cross_validation.cross_val_score(clf, X, y, cv=10, scoring='accuracy')
I get accuracy 0.67
please help me.
I came across this post having the same issue - wildly different accuracy between scikit-learn and e1071 bindings for libSVM. I think the issue is that e1071 scales the training data and then keeps the scaling parameters for using in predicting new observations. Scikit-learn does not do this and leaves it up the user to realize that the same scaling approach needs to be taken on both training and test data. I only thought to check this after encountering and reading this guide from the nice people behind libSVM.
While I don't have your data, str(svmfit) should give you the scaling params (mean and standard deviation of the columns of Bupa). You can use these to appropriately scale your data in Python (see below for an idea). Alternately, you can scale the entire dataset together in Python and then do test/train splits; either way should give you now identical predictions.
def manual_scale(a, means, sds):
a1 = a - means
a1 = a1/sds
return a1
When using Support Vector Regression in Python/sklearn and R/e1071 both x and y variables need to be scaled/unscaled.
Here is a self-contained example using rpy2 to show equivalence of R and Python results (first part with disabled scaling in R, second part with 'manual' scaling in Python):
# import modules
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import sklearn.model_selection
import sklearn.datasets
import sklearn.svm
import rpy2
import rpy2.robjects
import rpy2.robjects.packages
# use R e1071 SVM function via rpy2
def RSVR(x_train, y_train, x_test,
cost=1.0, epsilon=0.1, gamma=0.01, scale=False):
# convert Python arrays to R matrices
rx_train = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_train).T.flatten()), nrow = len(x_train))
ry_train = rpy2.robjects.FloatVector(np.array(y_train).flatten())
rx_test = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_test).T.flatten()), nrow = len(x_test))
# train SVM
e1071 = rpy2.robjects.packages.importr('e1071')
rsvr = e1071.svm(x=rx_train,
y=ry_train,
kernel='radial',
cost=cost,
epsilon=epsilon,
gamma=gamma,
scale=scale)
# run SVM
predict = rpy2.robjects.r['predict']
ry_pred = np.array(predict(rsvr, rx_test))
return ry_pred
# define auxiliary function for plotting results
def plot_results(y_test, py_pred, ry_pred, title, lim=[-500, 500]):
plt.title(title)
plt.plot(lim, lim, lw=2, color='gray', zorder=-1)
plt.scatter(y_test, py_pred, color='black', s=40, label='Python/sklearn')
plt.scatter(y_test, ry_pred, color='orange', s=10, label='R/e1071')
plt.xlabel('observed')
plt.ylabel('predicted')
plt.legend(loc=0)
return None
# get example regression data
x_orig, y_orig = sklearn.datasets.make_regression(n_samples=100, n_features=10, random_state=42)
# split into train and test set
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x_orig, y_orig, train_size=0.8)
# SVM parameters
# (identical but named differently for R/e1071 and Python/sklearn)
C = 1000.0
epsilon = 0.1
gamma = 0.01
# setup SVM and scaling classes
psvr = sklearn.svm.SVR(kernel='rbf', C=C, epsilon=epsilon, gamma=gamma)
x_sca = sklearn.preprocessing.StandardScaler()
y_sca = sklearn.preprocessing.StandardScaler()
# run R and Python SVMs without any scaling
# (see 'scale=False')
py_pred = psvr.fit(x_train, y_train).predict(x_test)
ry_pred = RSVR(x_train, y_train, x_test,
cost=C, epsilon=epsilon, gamma=gamma, scale=False)
# scale both x and y variables
sx_train = x_sca.fit_transform(x_train)
sy_train = y_sca.fit_transform(y_train.reshape(-1, 1))[:, 0]
sx_test = x_sca.transform(x_test)
sy_test = y_sca.transform(y_test.reshape(-1, 1))[:, 0]
# run Python SVM on scaled data and invert scaling afterwards
ps_pred = psvr.fit(sx_train, sy_train).predict(sx_test)
ps_pred = y_sca.inverse_transform(ps_pred.reshape(-1, 1))[:, 0]
# run R SVM with native scaling on original/unscaled data
# (see 'scale=True')
rs_pred = RSVR(x_train, y_train, x_test,
cost=C, epsilon=epsilon, gamma=gamma, scale=True)
# plot results
plt.subplot(121)
plot_results(y_test, py_pred, ry_pred, 'without scaling (Python/sklearn default)')
plt.subplot(122)
plot_results(y_test, ps_pred, rs_pred, 'with scaling (R/e1071 default)')
plt.tight_layout()
UPDATE: Actually, the scaling uses a slightly different definition of variance in R and Python, see this answer (1/(N-1)... in R vs. 1/N... in Python where N is the sample size). However, for typical sample sizes, this should be negligible.
I can confirm these statements. One indeed needs to apply the same scaling to the train and test sets. In particular I have done this:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(X)
where X is my training set. Then, when preparing the test set, I have simply used the StandardScaler instance obtained from the scaling of the training test. It is important to used it just for transforming, not for fitting and transforming (like above), i.e.:
X_test = sc_X.transform(X_test)
This allowed on obtaining substantial agreement between R and scikit-learn results.