How to fix NameError: name 'X_train' is not defined? - python

I am running the [code] of multi-label classification1.how to fix the NameError that the "X_train" is not defined.the python code is given below.
import scipy
from scipy.io import arff
data, meta = scipy.io.arff.loadarff('./yeast/yeast-train.arff')
from sklearn.datasets import make_multilabel_classification
# this will generate a random multi-label dataset
X, y = make_multilabel_classification(sparse = True, n_labels = 20,
return_indicator = 'sparse', allow_unlabeled = False)
# using binary relevance
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB
# initialize binary relevance multi-label classifier
# with a gaussian naive bayes base classifier
classifier = BinaryRelevance(GaussianNB())
# train
classifier.fit(X_train, y_train)
# predict
predictions = classifier.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test,predictions)

You forgot to split the dataset into train and test sets.
Import the library
from sklearn.model_selection import train_test_split
Add this line before classifier.fit()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

X_train does not exist, you have to split between train and test :
from sklearn.preprocessing import StandardScaler
s =StandardScaler()
X_train = s.fit_transform(X_train)
X_test = s.fit_transform(X_test)

Related

How can i create an instance of multi-layer perceptron network to use in bagging classifier?

i am trying to create an instance of multi-layer perceptron network to use in bagging classifier. But i don't understand how to fix them.
Here is my code:
My task is:
1-To apply bagging classifier (with or without replacement) with eight base classifiers created at the previous step.
It would be really great if you show me how can i implement this to my algorithm. I did my search but i couldn't find a way to do that
To train your BaggingClassifier:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import classification_report, confusion_matrix
#Load the digits data:
X,y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
# Feature scaling
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# Finally for the MLP- Multilayer Perceptron
mlp = MLPClassifier(hidden_layer_sizes=(16, 8, 4, 2), max_iter=1001)
clf = BaggingClassifier(mlp, n_estimators=8)
clf.fit(X_train,y_train)
To analyze your output you may try:
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
print(cm)
To see num of correctly predicted instances per class:
print(cm[np.eye(len(clf.classes_)).astype("bool")])
To see percentage of correctly predicted instances per class:
cm[np.eye(len(clf.classes_)).astype("bool")]/cm.sum(1)
To see total accuracy of your algo:
(y_pred==y_test).mean()
EDIT
To access predictions on a per base estimator basis, i.e. your mlps, you can do:
estimators = clf.estimators_
# print(len(estimators), type(estimators[0]))
preds = []
for base_estimator in estimators:
preds.append(base_estimator.predict(X_test))

How to feed data into random forest classifier and see prediction

I have build a random forest classifier using scikit learn and python, and I am having trouble actually feeding data in to see the prediction. I want to see the format of the output, and to convert this to a json file. I have attached the code for the random forest and what the data looks like. I believe I need to use 'y_pred', but I am not sure what format the input data needs to be.
X = dataset.iloc[:, 2:4].values
y = dataset["pages"]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=20, random_state = 0)
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
You can simply concatenate the predicted values with the matrix of features.
Also note that the pipeline is exactly for this purpose, when you first want to transform the data and then apply some classifier.
This should work for you:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
X = dataset.iloc[:, 2:4].values
y = dataset["pages"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
classifier = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=20, random_state=0))
classifier = classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
pred = pd.concat([X_test, pd.Series(y_pred, name="pages")], axis=1)

I am getting Not Fitted error in random forest classifier?

I have 4 features and one target variable. I am using RandomForestRegressor instead of RandomForestClassifer as my target variable is float. When I am trying to fit my model and then output them in sorted order to get the important features I am getting Not fitted error how to fix it?
Code:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
# Split the data into 30% test and 70% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
feat_labels = data.columns[:4]
regr = RandomForestRegressor(max_depth=2, random_state=0)
#clf = RandomForestClassifier(n_estimators=100, random_state=0)
# Train the classifier
#clf.fit(X_train, y_train)
regr.fit(X, y)
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
You are fitting to regr but calling the feature importances on clf. Try calling this instead:
importances = regr.feature_importances_
I noticed that previously your classifier was being fit with the training data you setup, but the regressor is now being fit with X and y.
However, I don't see here where you're setting X and y in the first place or even more where you actually load in a dataset. Could it be you forgot this step as well as what Harpal mentioned in another answer?

ValueError: Unknown label type: 'continuous', SVC Sklearn

I am trying to use this library from sklearn named SVC.
However I have this error when I run my program:
ValueError: Unknown label type: 'continuous'
I do not know if there is a regressor library for Support Vector Regressor, this is the only I have found so far. Here is my code:
import sklearn
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
X, Y = get_data(filename)
X_train, X_test, y_train, y_test = train_test_split(X, Y, random_state=33)
svc = SVC()
svc.fit(X_train, y_train)
print(svc.score(X_train, y_train))
print(svc.score(X_test, y_test))
Thanks.
SVC is a classifier so will not support continous values in targets. What you need is SVR. Just replace all occurences of SVC with SVR and you are good to go.
from sklearn.svm import SVR
svr = SVR()
svr.fit(X_train, y_train)
print(svr.score(X_train, y_train))
print(svr.score(X_test, y_test))

ValueError unknown label type array sklearn- load_boston

I am using the following code to check SGDClassifier
import numpy as np
from sklearn.datasets import load_boston
from sklearn.linear_model import SGDClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
data = load_boston()
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target)
x_scalar = StandardScaler()
y_scalar = StandardScaler()
x_train = x_scalar.fit_transform(x_train)
y_train = y_scalar.fit_transform(y_train)
x_test = x_scalar.transform(x_test)
y_test = y_scalar.transform(y_test)
regressor = SGDClassifier(loss='squared_loss')
scores = cross_val_score(regressor, x_train, y_train, cv=5)
print 'cross validation r scores ', scores
print 'average score ', np.mean(scores)
regressor.fit_transform(x_train, y_train)
print 'test set r score ', regressor.score(x_test,y_test)
However when I run it I get deprecation warnings to reshape and
the following value error
ValueError Traceback (most recent call last)
<ipython-input-55-4d64d112f5db> in <module>()
18
19 regressor = SGDClassifier(loss='squared_loss')
---> 20 scores = cross_val_score(regressor, x_train, y_train, cv=5)
ValueError: Unknown label type: (array([ -1.89568750e+00, -1.75715217e+00, -1.68255622e+00,
-1.66124309e+00, -1.62927339e+00, -1.54402088e+00,
-1.49073806e+00, -1.41614211e+00, -1.40548554e+00,
-1.34154616e+00, -1.32023303e+00, -1.30957647e+00,
-1.27760677e+00, -1.26695021e+00, -1.25629365e+00,
-1.20301082e+00, -1.17104113e+00, -1.16038457e+00,....]),)
What could be the probable error in the code ?
In classification tasks, the dependent variable (or the target) is categorical. We try to predict if a claim is fraudulent or not, for example. In regression, on the other hand, the dependent variable is numerical. It can be measured.
In the Boston Housing dataset, the dependent variable is "Median value of owner-occupied homes in $1000's" (You can see the description by executing print(data.DESCR)). It is a continuous variable and cannot be predicted with a classifier.
If you want to test the classifier, you can use another dataset. For example, change load_boston() to load_iris(). Note that you also need to remove the transformation for the target variable - it is for numerical variables. With these modifications, it should work correctly.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import SGDClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
data = load_iris()
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target)
x_scalar = StandardScaler()
x_train = x_scalar.fit_transform(x_train)
x_test = x_scalar.transform(x_test)
classifier = SGDClassifier(loss='squared_loss')
scores = cross_val_score(classifier, x_train, y_train, cv=5)
scores
Out: array([ 0.33333333, 0.2173913 , 0.31818182, 0. , 0.19047619])

Categories