Though I have used StandardScaler or MinMaxScaler to preprocess my data while solving problem with MLPRegressor in sklearn, the predicted values have lot of negative numbers though the training set has all real positive values. Data is here:
https://drive.google.com/open?id=1JF_EpyiMF5WzKZOt6d0iA2174eAheaTW.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPRegressor
x_train, x_test, y_train, y_test = train_test_split(x,y)
min_max_scaler = preprocessing.MinMaxScaler()
x_train = min_max_scaler.fit_transform(x_train)
x_test = min_max_scaler.transform(x_test)
mlp = MLPRegressor(activation='logistic' , solver='sgd' ,verbose=10, hidden_layer_sizes=(10,10), max_iter=1000)
mlp.fit(x_train, y_train)
print("Training set score :%f" % mlp.score(x_train, y_train))
print("Test score :%f" % mlp.score(x_test, y_test))
predictions = mlp.predict(x_test)
Any suggestion where is the problem?
Related
The performance of the model does not increase during training epoch(s) where values are sorted by a specific row key. Dataset is balance and have 40,000 records with binary classification(0,1).
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
Linear_SVC_classifier = SVC(kernel='linear', random_state=1)#supervised learning
Linear_SVC_classifier.fit(x_train, y_train)
SVC_Accuracy = accuracy_score(y_test, SVC_Prediction)
print("\n\n\nLinear SVM Accuracy: ", SVC_Accuracy)
Add a count vectorizer to your train data and use logistic regression model
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
cv = CountVectorizer()
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)
model = LogisticRegression()
model.fit(ctmTr, y_train)
y_pred_class = model.predict(X_test_dtm)
SVC_Accuracy = accuracy_score(y_test)
print("\n\n\nLinear SVM Accuracy: ", SVC_Accuracy)
the above model definition is something 'equivalent' to this statement
Linear_SVC_classifier = SVC(kernel='linear', random_state=1)
Linear_SVC_classifier.fit(ctmTr, y_train)
Here is my code, and it always returns 100% accuracy, regardless of how big the test size is. I used the train_test_split method, so I do not believe there should be any duplicates of data. Could someone inspect my code?
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
prices.shape
(20640,)
features.shape
(20640, 8)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_train.shape
(16512,)
X_train.shape
(16512, 8)
predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score
EDIT: I have reworked my answer since I found multiple issues. Please copy-paste the below code to ensure no bugs are left.
Issues -
You are using DecisionTreeClassifier instead of DecisionTreeRegressor for a regression problem.
You are removing nans after doing the test train split which will mess up the count of samples. Do the data.dropna() before the split.
You are using the model.score(X_test, y_test) incorrectly by passing it (X_test, predictions). Please use accuracy_score(X_test, predictions) with those parameters instead, or fix the syntax.
from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
data = data.dropna() #<--- SECOND ISSUE
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
score
I have build a random forest classifier using scikit learn and python, and I am having trouble actually feeding data in to see the prediction. I want to see the format of the output, and to convert this to a json file. I have attached the code for the random forest and what the data looks like. I believe I need to use 'y_pred', but I am not sure what format the input data needs to be.
X = dataset.iloc[:, 2:4].values
y = dataset["pages"]
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=20, random_state = 0)
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
You can simply concatenate the predicted values with the matrix of features.
Also note that the pipeline is exactly for this purpose, when you first want to transform the data and then apply some classifier.
This should work for you:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
X = dataset.iloc[:, 2:4].values
y = dataset["pages"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
classifier = make_pipeline(StandardScaler(), RandomForestClassifier(n_estimators=20, random_state=0))
classifier = classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
pred = pd.concat([X_test, pd.Series(y_pred, name="pages")], axis=1)
I am trying to write the code for K-NN
Below is my code. - I know that issue is in `predict() but I am not able to figure out how o fix it.
# Importing the libraries
import numpy as np
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('UniversalBank.csv')
X = dataset.iloc[:,[ 1,2,3,5,6,7,8,10,11,12,13]].values #,
y = dataset.iloc[:,9].values
#Splitting the dataset to training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state= 0)
#Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
#Fitting the classifier to training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train,y_train)
#Predicting the test results
y_pred = classifier.predict(X_test)
I am using the following code to check SGDClassifier
import numpy as np
from sklearn.datasets import load_boston
from sklearn.linear_model import SGDClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
data = load_boston()
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target)
x_scalar = StandardScaler()
y_scalar = StandardScaler()
x_train = x_scalar.fit_transform(x_train)
y_train = y_scalar.fit_transform(y_train)
x_test = x_scalar.transform(x_test)
y_test = y_scalar.transform(y_test)
regressor = SGDClassifier(loss='squared_loss')
scores = cross_val_score(regressor, x_train, y_train, cv=5)
print 'cross validation r scores ', scores
print 'average score ', np.mean(scores)
regressor.fit_transform(x_train, y_train)
print 'test set r score ', regressor.score(x_test,y_test)
However when I run it I get deprecation warnings to reshape and
the following value error
ValueError Traceback (most recent call last)
<ipython-input-55-4d64d112f5db> in <module>()
18
19 regressor = SGDClassifier(loss='squared_loss')
---> 20 scores = cross_val_score(regressor, x_train, y_train, cv=5)
ValueError: Unknown label type: (array([ -1.89568750e+00, -1.75715217e+00, -1.68255622e+00,
-1.66124309e+00, -1.62927339e+00, -1.54402088e+00,
-1.49073806e+00, -1.41614211e+00, -1.40548554e+00,
-1.34154616e+00, -1.32023303e+00, -1.30957647e+00,
-1.27760677e+00, -1.26695021e+00, -1.25629365e+00,
-1.20301082e+00, -1.17104113e+00, -1.16038457e+00,....]),)
What could be the probable error in the code ?
In classification tasks, the dependent variable (or the target) is categorical. We try to predict if a claim is fraudulent or not, for example. In regression, on the other hand, the dependent variable is numerical. It can be measured.
In the Boston Housing dataset, the dependent variable is "Median value of owner-occupied homes in $1000's" (You can see the description by executing print(data.DESCR)). It is a continuous variable and cannot be predicted with a classifier.
If you want to test the classifier, you can use another dataset. For example, change load_boston() to load_iris(). Note that you also need to remove the transformation for the target variable - it is for numerical variables. With these modifications, it should work correctly.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import SGDClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
data = load_iris()
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target)
x_scalar = StandardScaler()
x_train = x_scalar.fit_transform(x_train)
x_test = x_scalar.transform(x_test)
classifier = SGDClassifier(loss='squared_loss')
scores = cross_val_score(classifier, x_train, y_train, cv=5)
scores
Out: array([ 0.33333333, 0.2173913 , 0.31818182, 0. , 0.19047619])