Unexpected behavior of ML Logistic Regression? - python

I am new to AI and ML so apologies if this is a stupid question.
I was reading about Logistic Regression, and found out it is a classification supervised ML model.
So I tried to code an example to give it a try. My idea was to see if the program was able to figure out the "rule" behind the label (Y) I established, which is "Y = 1 if and only if X1 OR X2 is a multiple of 3 but not both, 0 otherwise"
But as you can see the accuracy is very poor. Am I doing something wrong? Did I misunderstood the concept of Logistic Regression?
DATASET:
3,1,1
2,3,1
1,1,0
2,4,0
5,6,1
9,3,1
8,9,1
5,5,0
9,9,0
5,7,0
3,3,0
5,3,1
2,4,0
7,7,0
4,9,1
7,3,1
6,2,1
8,1,0
6,4,0
9,4,1
CODE:
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
col_names = ['x1', 'x2', 'y']
multi3 = pd.read_csv("1.csv", header=None, names=col_names)
feature_cols = ['x1', 'x2']
X = multi3[feature_cols]
y = multi3.y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred))
print("Recall:", metrics.recall_score(y_test, y_pred))
OUTPUT:
[[1 2]
[1 1]]
Accuracy: 0.4
Precision: 0.3333333333333333
Recall: 0.5
EDIT:
Source code of my comment below.

You can visualize your data:
multi3.plot.scatter(x = "x1",y="x2", c = "y",cmap="viridis")
You can see there's no clear separation between your two different classes (0 or 1). So the accuracy you get, even though with a small test set, would be low, because x1 and x2 are not useful at all in discriminating the labels.
In the code you posted, you did it over a larger dataset and also simulated data, if we do something similar,
import numpy as np
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,10,(60,2)),columns=['x1', 'x2'])
df['y'] = ((df['x1']>5) & (df['x2'] > 5)).astype(int)
logreg = LogisticRegression()
logreg.fit(df[['x1','x2']], df['y'])
y_pred = logreg.predict(df[['x1','x2']])
cnf_matrix = metrics.confusion_matrix(df['y'], y_pred)
cnf_matrix
array([[49, 2],
[ 2, 7]])
And of course, you can see that there is separation:
My guess is that the original dataset is wrong or has nothing to do with what you posted as an image.

Related

SVR/SVM output predictions are very similar to each other but far from true value

The main idea is to predict 2 target output, based on input features.
the input features are already scaled using Standardscaler() from sklearn.
size of X_train is (190 x 6), Y_train = (190 x 2). X_test is (20 x 6), Y_test = (20x2)
linear and rbf kernel also make use of GridsearchCV to find the best C (linear), gamma and C ('rbf')
[PROBLEM] I perform SVR utilizing MultiOutputRegressor on both linear and rbf kernel but, the predicted outputs are very similar to each other (not exactly constant prediction) and pretty far from the true value of y.
Below are the plots where the scatter plot represent the true value of Y. first picture correspond to result of first target, Y[:,0]. while second picture is second target, Y[:,1].
Do i have to scale my target output? Any other model that could help improving test accuracy?
I have tried random forest regressor and perform tuning as well, and test accuracy is about similar to what I'm getting with SVR. (below result from SVR)
Best parameter: {'estimator__C': 1}
MAE: [18.51151192 9.604601 ] #from linear kernel
Best parameter (rbf): {'estimator__C': 1, 'estimator__gamma': 1e-09}
MAE (rbf): [17.80482033 9.39780134] #from rbf kernel
Thankyou so much! any help and input is greatly appreciated!! ^__^
---------------- Code -----------------------------
import numpy as np
from numpy import load
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.multioutput import MultiOutputRegressor
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import RepeatedKFold
rkf = RepeatedKFold(n_splits=5, n_repeats=3)
#input features - HR, HRV, PTT, breathing_rate, LASI, AI
X = load('200_patient_input_scaled.npy')
#Output features - SBP, DBP
Y = load('200_patient_output_raw.npy')
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.095, random_state = 43)
epsilon = 0.1
#--------------------------- Linear SVR kernel Model ------------------------------------------------------
linear_svr = SVR(kernel='linear', epsilon = epsilon)
multi_output_linear_svr = MultiOutputRegressor(linear_svr)
#multi_output_linear_svr.fit(X_train, Y_train) #just to see the output
#GridSearch - find the best C
grid = {'estimator__C': [1,10,10,100,1000] }
grid_linear_svr = GridSearchCV(multi_output_linear_svr, grid, scoring='neg_mean_absolute_error', cv=rkf, refit=True)
grid_linear_svr.fit(X_train, Y_train)
#Prediction
Y_predict = grid_linear_svr.predict(X_test)
print("\nBest parameter:", grid_linear_svr.best_params_ )
print("MAE:", mean_absolute_error(Y_predict,Y_test, multioutput='raw_values'))
#-------------------------- RBF SVR kernel Model --------------------------------------------------------
rbf_svr = SVR(kernel='rbf', epsilon = epsilon)
multi_output_rbf_svr = MultiOutputRegressor(rbf_svr)
#Grid search - Find best combination of C and gamma
grid_rbf = {'estimator__C': [1,10,10,100,1000], 'estimator__gamma': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2] }
grid_rbf_svr = GridSearchCV(multi_output_rbf_svr, grid_rbf, scoring='neg_mean_absolute_error', cv=rkf, refit=True)
grid_rbf_svr.fit(X_train, Y_train)
#Prediction
Y_predict_rbf = grid_rbf_svr.predict(X_test)
print("\nBest parameter (rbf):", grid_rbf_svr.best_params_ )
print("MAE (rbf):", mean_absolute_error(Y_predict_rbf,Y_test, multioutput='raw_values'))
#Plotting
plot_y_predict = Y_predict_rbf[:,1]
plt.scatter( np.linspace(0, 20, num = 20), Y_test[:,1], color = 'red')
plt.plot(np.linspace(0, 20, num = 20), plot_y_predict)
A common mistake is that when people use StandardScaler they use it along the wrong axis of the data. You may scale all the data, or row by row instead of column by column, please make sure you've done this right! I would do this by hand to be sure because else I think it needs different StandardScaler fit for each feature.
[RESPONSE/EDIT]: I think that just negates what StandardScaler did by inversing the application. I'm not entirely sure of the StandardScaler behaviour I'm just saying all this out of experience and having trouble scaling multiple feature data. If i were you (for example for MInMax scaling) I would prefer something like this:
columnsX = X.shape[1]
for i in range(columnsX):
X[:, i] = (X[:, i] - X[:, i].min()) / (X[:, i].max() - X[:, i].min())

Cannot use Mean Squared Logarithmic Error (negative predicted values) though normalized data and predictions > -1

I am trying to implement a simple Sklearn.linear_model.LinearRegression model and evaluate its performance through MSLE:
MSLE is based on SLE = (log(prediction + 1) - log(actual + 1))^2
I have something like 15 features, which all are normalized or standardized, all positive.
Though when I try to do a cross validation on my training data:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
lin_reg = LinearRegression()
linreg_scores = cross_val_score(lin_reg, X_train, y_train, cv = 10, scoring = 'neg_mean_squared_log_error')
I get the following error:
ValueError: Mean Squared Logarithmic Error cannot be used when targets contain negative values.
So I checked by hand doing a manual cross validation with sklearn.model_selection.KFold, in order to print the predicted values for each fold...
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_log_error
from sklearn.base import clone
kf = KFold(n_splits=5, shuffle=True, random_state=5)
lin_reg = LinearRegression()
split_count = 0
for train_index, val_index in kf.split(X_train, y_train):
split_count += 1
clone_reg = clone(lin_reg)
X_tr = X_train.loc[train_index, :]
X_val = X_train.loc[val_index, :]
y_tr = y_train.loc[train_index]
y_val = y_train.loc[val_index]
clone_reg.fit(X_tr, y_tr)
pred = clone_reg.predict(X_val)
if any(pred<0):
print(split_count)
print(pred[pred<0])
The thing is, I do get negative predicted values, but they are all between [-1, 0]:
1
[-0.08642619]
3
[-0.2426673]
5
[-0.51744243]
So according to the MSLE formula, (y_predict + 1) should be positive, thus ln(y_predict + 1) should be mathematically correct.
Is there something that I am missing here?
Thanks a lot for your help, I'll obviously provide any additional info if needed!

How to make logistic regression with fashion-MNIST dataset

I don`t know what to do next , i tried many times, but my teacher wants me to make 10 models.
import pandas as pd
from numpy import reshape
from sklearn import metrics
train = pd.read_csv('fashion_train.csv',header =None)
print(train.head())
label = train[0]
test = pd.read_csv('fashion_test.csv',header = None)
print(test.head())
labelT = test[0]
print(labelT)
X_train = train.iloc[:, 1:]
print(X_train)
y_train = train.iloc[:, 0]
print(y_train)
X_test = test.iloc[:, 1:]
y_test = test.iloc[:, 0]
from sklearn.linear_model import LogisticRegression
Okay So what you need to do now is to train your model. So just add the following lines to your code.
model = LogisticRegression()
model.fit(X_train, y_train)
prediction = model.predict(X_test)
Now you can use any accuracy calculator. For example:
score = sklearn.metrics.classification_report(y_test, prediction)
print(score)
also I suggest change the import to import sklearn

KNN model, accuracy(clf.score) returns 0

I am working one a simple KNN model with 3NN to predict a weight,
However, the accuracy is 0.0, I don't know why.
The code can give me a prediction on weight with 58 / 59.
This is the reproducible code
import numpy as np
from sklearn import preprocessing, neighbors
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.metrics import accuracy_score
#Create df
data = {"ID":[i for i in range(1,11)],
"Height":[5,5.11,5.6,5.9,4.8,5.8,5.3,5.8,5.5,5.6],
"Age":[45,26,30,34,40,36,19,28,23,32],
"Weight": [77,47,55,59,72,60,40,60,45,58]
}
df = pd.DataFrame(data, columns = [x for x in data.keys()])
print("This is the original df:")
print(df)
#Feature Engineering
df.drop(["ID"], 1, inplace = True)
X = np.array(df.drop(["Weight"],1))
y = np.array(df["Weight"])
#define training and testing
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size =0.2)
#Build clf with n =3
clf = neighbors.KNeighborsClassifier(n_neighbors=3)
clf.fit(X_train, y_train)
#accuracy
accuracy = clf.score(X_test, y_test)
print("\n accruacy = ", accuracy)
#Prediction on 11th
ans = np.array([5.5,38])
ans = ans.reshape(1,-1)
prediction = clf.predict(ans)
print("\nThis is the ans: ", prediction)
You are classifying Weight which is a continuous (not a discrete) variable. This should be a regression rather than a classification. Try KNeighborsRegressor.
To evaluate your result, use metrics for regression such as R2 score.
If your score is low, that can mean different things: training set too small, test set too different from training set, regression model not adequate...

Logistic Regression - Machine Learning

Logistic Regression with inputs of "Machine Learning.csv" file.
#Import Libraries
import pandas as pd
#Import Dataset
dataset = pd.read_csv('Machine Learning Data Set.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 10]
#Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
#Fitting Logistic Regression to the Training Set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0)
classifier.fit(X_train,y_train)
#Predicting the Test set results
y_pred = classifier.predict(X_test)
#Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
I have a machine learning / logistic regression code (python) as above. It has properly trained my model and gives a really good match with the test data. But unfortunately it is only giving me 0/1 (binary) results when I test with some other random values. (the training set has only 0/1 - as in failed/succeeded)
How can I get a probability result instead of a binary result in this algorithm? I have tried very different set of numbers and would like find out a probability of failing - instead of a 0 and 1.
Any help is strongly appreciated :) Thanks a lot!
Just replace
y_pred = classifier.predict(X_test)
with
y_pred = classifier.predict_proba(X_test)
For details refer Logistic Regression Probability
predict_proba(X_test) will give you probability of each sample for each class.i.e if X_test contains n_samples and you have 2 classes output of above function will be a "n_samples X 2 " matrix. and sum of two classes predicted will be 1. for more details have a look at documentation here

Categories