My accuracy is at 0.0 and I don't know why? - python

I am getting an accuracy of 0.0. I am using the boston housing dataset.
Here is my code:
import sklearn
from sklearn import datasets
from sklearn import svm, metrics
from sklearn import linear_model, preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
boston = datasets.load_boston()
x = boston.data
y = boston.target
train_data, test_data, train_label, test_label = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
model = KNeighborsClassifier()
lab_enc = preprocessing.LabelEncoder()
train_label_encoded = lab_enc.fit_transform(train_label)
test_label_encoded = lab_enc.fit_transform(test_label)
model.fit(train_data, train_label_encoded)
predicted = model.predict(test_data)
accuracy = model.score(test_data, test_label_encoded)
print(accuracy)
How can I increase the accuracy on this dataset?

Boston dataset is for regression problems. Definition in the docs:
Load and return the boston house-prices dataset (regression).
So, it does not make sense if you use an ordinary encoding like the labels are not samples from a continuous data. For example, you encode 12.3 and 12.4 to completely different labels but they are pretty close to each other, and you evaluate the result wrong if the classifier predicts 12.4 when the real target is 12.3, but this is not a binary situation. In classification, the prediction is whether correct or not, but in regression it is calculated in a different way such as mean square error.
This part is not necessary, but I would like to give you an example for the same dataset and source code. With a simple idea of rounding the labels towards zero(to the nearest integer to zero) will give you some intuition.
5.0-5.9 -> 5
6.0-6.9 -> 6
...
50.0-50.9 -> 50
Let's change your code a little bit.
import numpy as np
def encode_func(labels):
return np.array([int(l) for l in labels])
...
train_label_encoded = encode_func(train_label)
test_label_encoded = encode_func(test_label)
The output will be around 10%.

Related

how to build an artificial neural network with scikit-learn?

I am trying to run an artificial neural network with scikit-learn.
I want to run the regression, get the model fit results, an generate out of sample forecasts.
This is my code below. Any help will be greatly appreciated.
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor
#import the data
df=pd.read_excel(r"C:\Users\Action\Downloads\Python\Practice_Data\sorted_data v2.xlsx")
#view the data
df.head(5)
#to drop a column of data type
df2=df.drop('Unnamed: 13', axis=1)
#view the data
df2.head(5)
Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import r2_score
describe the data
df.describe().transpose()
target_column = ['public health care services']
predictors = list(set(list(df.columns))-set(target_column))
df[predictors] = df[predictors]/df[predictors].max()
df.describe().transpose()
set the X and Y
X = df[predictors].values
y = df[target_column].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
import MLP Classifier and fit the network
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(8,8,8), activation='relu', solver='adam', max_iter=500)
mlp.fit(X_train,y_train)
predict_train = mlp.predict(X_train)
set up the MLP Classifier
mlp = MLPClassifier(
hidden_layer_sizes=(50, 8),
max_iter=15,
alpha=1e-4,
solver="sgd",
verbose=True,
random_state=1,
learning_rate_init=0.1)
import the warnings
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn")
predict_test = mlp.predict(X_test)
to train on the data I use the MLPClassifier to call the fit function on the training data.
mlp.fit(X_train, y_train)
after this, the neural network is done training.
after the neural network is trained, the next step is to test it.
print out the model scores
print(f"Training set score: {mlp.score(X_train, y_train)}")
print(f"Test set score: {mlp.score(X_test, y_test)}")
y_predict = mlp.predict(X_train)
I am getting an error from below
x_ann = y_predict[:, 0]
y_ann = y_predict[:, 1]
The error message is
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
any help will be greatly appreciated
predict function gives you the actual class and since your point can belong to one and only one class (except multi label), it is supposed to be like this only
What is the shape of your Y_true_labels? Might be the case that your labels are Sparse and with 2 classes, means 0,1 and since the models is minimising Log Loss as described here as:
This model optimizes the log-loss function using LBFGS or stochastic gradient descent.
Also looking at the predict() it says:
log_y_probndarray of shape (n_samples, n_classes)
The predicted log-probability of the sample for each class in the model, where classes are ordered as they are in self.classes_. Equivalent to log(predict_proba(X))
So it means that if probability is 0.3 it means it belongs to class and if it's 0.7 it belongs to class, ASSUMING it's binary classification with a threshold set to 0.5.
What you might be confusing with is the predict_proba() function which gives you the probabilities for each classes.
Might be the case. Please post your X,Y data shape and type so that we can understand better.

sklearn model returns a mean absolute error of 0, why?

Toying around with sklearn and I wanted to predict TSLA Close prices for a few dates using the Open, High, Low prices and the Volume. I used a very basic model to predict the close and they were supposedly 100% accurate and I'm not sure why. The 0% error makes me feel as if I didn't set up my model correctly.
Code:
from os import X_OK
from numpy.lib.shape_base import apply_along_axis
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
tsla_data_path = "/Users/simon/Documents/PythonVS/ML/TSLA.csv"
tsla_data = pd.read_csv(tsla_data_path)
tsla_features = ['Open','High','Low','Volume']
y = tsla_data.Close
X = tsla_data[tsla_features]
# define model
tesla_model = DecisionTreeRegressor(random_state = 1)
# fit model
tesla_model.fit(X,y)
#print results
print('making predictions for the following five dates')
print(X.head())
print('________________________________________________')
print('the predictions are')
print(tesla_model.predict(X.head()))
print('________________________________________________')
print('the error is ')
print(mean_absolute_error(y.head(),tesla_model.predict(X.head())))
Output:
making predictions for the following five dates
Open High Low Volume
0 67.054001 67.099998 65.419998 39737000
1 66.223999 66.786003 65.713997 27778000
2 66.222000 66.251999 65.500000 12328000
3 65.879997 67.276001 65.737999 30372500
4 66.524002 67.582001 66.438004 32868500
________________________________________________
the predictions are
[65.783997 66.258003 65.987999 66.973999 67.239998]
________________________________________________
the error is
0.0
Data:
Date,Open,High,Low,Close,Adj_Close,Volume
2019-11-26,67.054001,67.099998,65.419998,65.783997,65.783997,39737000
2019-11-27,66.223999,66.786003,65.713997,66.258003,66.258003,27778000
2019-11-29,66.222000,66.251999,65.500000,65.987999,65.987999,12328000
2019-12-02,65.879997,67.276001,65.737999,66.973999,66.973999,30372500
2019-12-03,66.524002,67.582001,66.438004,67.239998,67.239998,32868500
You're making a mistake by measuring the performance of your model on the dataset used to train it.
If you want to have a proper evaluation metric of your performance, you should split your dataset into 2 datasets. One that is used to train the model and the other to measure its performance. You can split your dataset using sklearn.model_selection.train_test_split() as follow:
tesla_model = DecisionTreeRegressor(random_state = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
tesla_model.fit(X_train, X_test)
mae = mean_absolute_error(y_test,tesla_model.predict(X_test))
Have a look at this Wikipedia page explaining the differents dataset in ML.

Mean squared error is enormous when using Scikit Learn

I have been battling this problem with my MSE while predicting with regression. I have encountered the same problem with different regression models I have tried to build.
The problem is, my MSE is humongous. 83661743.99 to be exact. My R squared is 0.91 which does not seem problematic.
I manually implemented the cost function and gradient descent while doing the coursework in Andrew Ng's Stanford ML classes and I have a reasonable cost function; but when I try to implement it with SKLearn lib the MSE is something else. I don't know what I have done wrong and I need some help checking it out.
Here is the link to the dataset I used: https://www.kaggle.com/farhanmd29/50-startups
My code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
df = pd.read_csv('50_Startups.csv')
#checking the level of correlations between the predictors and response
sns.heatmap(df.corr(), annot=True)
#Splitting the predictors from the response
X = df.iloc[:,:-1].values
y = df.iloc[:,4].values
#Encoding the Categorical values
label_encoder_X = LabelEncoder()
X[:,3] = label_encoder_X.fit_transform(X[:,3])
#Feature Scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)
#splitting train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)
#Linear Regression
model = LinearRegression()
model.fit(X_train,y_train)
pred = model.predict(X_test)
#Cost Function
mse = mean_squared_error(y_test,pred)
mse
As you used standard normalization for scaling, the values of the dataset can be humongous. As desertnaut said, MSE is not scaled so it can be huge due to the big values of the dataset. You can try to normalize data using a MinMaxScaler to get the iput between [0-1]
I have gotten to understand the error of my ways. The MSE is 1/n (No of Samples) multiplied by the summation of the actual response subtracted by the predicted response SQUARED. Hence the error given will be SQUARED the expected error value. what I should have looked out for was the RMSE which will find the sqrt of the MSE. my predictions were off as well and that was because I scaled my values. Un-scaled X values gave me much better predictions. This I will have to look into more as I do not understand why.

ROC curve for multi-class classification without one vs all in python

I have a multi-class classification problem with 9 different classes. I am using the AdaBoostClassifier class from scikit-learn to train my model without using the one vs all technique, as the number of classes is very high and it might be inefficient.
I have tried using the tips from the documentation in scikit learn [1], but there the one vs all technique is used, which is substantially different. In my approach I only get one prediction per event, i.e. if I have n classes, the outcome of the prediction is a single value within the n classes. For the one vs all approach, on the other hand, the outcome of the prediction is an array of size n with a sort of likelihood value per class.
[1]
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
The code is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Matplotlib plotting library for basic visualisation
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
# Read data
df = pd.read_pickle('data.pkl')
# Create the dependent variable class
# This will substitute each of the n classes from
# text to number
factor = pd.factorize(df['target_var'])
df.target_var= factor[0]
definitions = factor[1]
X = df.drop('target_var', axis=1)
y = df['target_var]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
bdt_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=250,
learning_rate=0.3)
bdt_clf.fit(X_train, y_train)
y_pred = bdt_clf.predict(X_test)
#Reverse factorize (converting y_pred from 0s,1s, 2s, etc. to their original values
reversefactor = dict(zip(range(9),definitions))
y_test_rev = np.vectorize(reversefactor.get)(y_test)
y_pred_rev = np.vectorize(reversefactor.get)(y_pred)
I tried directly with the roc curve function, and also binarising the labels, but I always get the same error message.
def multiclass_roc_auc(y_test, y_pred):
lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test = lb.transform(y_test)
y_pred = lb.transform(y_pred)
return roc_curve(y_test, y_pred)
multiclass_roc_auc(y_test, y_pred_test)
The error message is:
ValueError: multilabel-indicator format is not supported
How could this be sorted out? Am I missing some important concept?

Why is Multi Class Machine Learning Model Giving Bad Results?

I have the following code so far:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df_train = pd.read_csv('uc_data_train.csv')
del df_train['Unnamed: 0']
temp = df_train['size_womenswear']
del df_train['size_womenswear']
df_train['size_womenswear'] = temp
df_train['count'] = 1
print(df_train.head())
print(df_train.dtypes)
print(df_train[['size_womenswear', 'count']].groupby('size_womenswear').count()) # Determine number of unique catagories, and number of cases for each catagory
del df_train['count']
df_test = pd.read_csv('uc_data_test.csv')
del df_test['Unnamed: 0']
print(df_test.head())
print(df_test.dtypes)
df_train.drop(['customer_id','socioeconomic_status','brand','socioeconomic_desc','order_method',
'first_order_channel','days_since_first_order','total_number_of_orders', 'return_rate'], axis=1, inplace=True)
LE = preprocessing.LabelEncoder() # Create label encoder
df_train['size_womenswear'] = LE.fit_transform(np.ravel(df_train[['size_womenswear']]))
print(df_train.head())
print(df_train.dtypes)
x = df_train.iloc[:,np.arange(len(df_train.columns)-1)].values # Assign independent values
y = df_train.iloc[:,-1].values # and dependent values
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.25, random_state = 0) # Testing on 75% of the data
model = GaussianNB()
model.fit(xTrain, yTrain)
yPredicted = model.predict(xTest)
#print(yPrediction)
print('Accuracy: ', accuracy_score(yTest, yPredicted))
I am not sure how to include the data that I am using but I am trying to predict the 'size_womenswear'. There are 8 different sizes that I have encoded to predict and I have moved this column to the end of the dataframe. so y is the dependent and x are the independent (all the other columns)
I am using a Gaussian Naive Bayes classifier to try and classify the 8 different sizes and then test on 25% of the data. The results are not very good.
I don't know why I am only getting an accuracy of 61% when I am working with 80,000 rows. I am very new to Machine Learning and would appreciate any assistance. Is there a better method that I could use in this case than Gaussian Naive Bayes?
can't comment, just throwing out some ideas;
Maybe you need to deal with class imbalance, and try other model that will fit the data better? try the xgboost or lightgbm package given good data they usually perform pretty good in general, but it really depends on the data.
Also the way you split train and test, does the resulting train and test data set has similar distribution for your Y? that's very important.
Last thing, for classification models the performance measurement can be a bit tricky, try some other measurement methods. F1 scores or try to draw a confusion matrix and see what your predictions vs Y looks like. perhaps your model is predicting everything to one
or just a few classes.

Categories