I have this code:
X, y = make_classification(n_features=2,n_redundant=0,n_samples=400, random_state=17)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=17)
clf = DecisionTree(max_depth=4, criterion='gini')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
prob_pred = clf.predict_proba(X_test)
accuracy = accuracy_score(y_test,y_pred)
However, there is an error Expected array-like (array or non-string sequence), got None in the last line accuracy = accuracy_score(y_test,y_pred). How can I fix it?
Your code with a minor fix works well:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
X, y = make_classification(n_features=2,n_redundant=0,n_samples=400, random_state=17)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=17)
clf = DecisionTreeClassifier(max_depth=4, criterion='gini') # Not DecisionTree
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
prob_pred = clf.predict_proba(X_test)
accuracy = accuracy_score(y_test,y_pred)
Output:
>>> accuracy
0.8833333333333333
Related
The performance of the model does not increase during training epoch(s) where values are sorted by a specific row key. Dataset is balance and have 40,000 records with binary classification(0,1).
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
Linear_SVC_classifier = SVC(kernel='linear', random_state=1)#supervised learning
Linear_SVC_classifier.fit(x_train, y_train)
SVC_Accuracy = accuracy_score(y_test, SVC_Prediction)
print("\n\n\nLinear SVM Accuracy: ", SVC_Accuracy)
Add a count vectorizer to your train data and use logistic regression model
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
cv = CountVectorizer()
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)
model = LogisticRegression()
model.fit(ctmTr, y_train)
y_pred_class = model.predict(X_test_dtm)
SVC_Accuracy = accuracy_score(y_test)
print("\n\n\nLinear SVM Accuracy: ", SVC_Accuracy)
the above model definition is something 'equivalent' to this statement
Linear_SVC_classifier = SVC(kernel='linear', random_state=1)
Linear_SVC_classifier.fit(ctmTr, y_train)
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
y_pred = none
accuracy = accuracy_score(y_test, y_pred)
print (accuracy)
What should I put in the y_pred = none area? Is there anything wrong with my code?
Normally you would split your data in train and test, below an example using iris:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
We can fit the model like you did:
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
Get the prediction on test set and score with actual value:
y_pred = logistic_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
Here is my code, and it always returns 100% accuracy, regardless of how big the test size is. I used the train_test_split method, so I do not believe there should be any duplicates of data. Could someone inspect my code?
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
prices.shape
(20640,)
features.shape
(20640, 8)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_train.shape
(16512,)
X_train.shape
(16512, 8)
predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score
EDIT: I have reworked my answer since I found multiple issues. Please copy-paste the below code to ensure no bugs are left.
Issues -
You are using DecisionTreeClassifier instead of DecisionTreeRegressor for a regression problem.
You are removing nans after doing the test train split which will mess up the count of samples. Do the data.dropna() before the split.
You are using the model.score(X_test, y_test) incorrectly by passing it (X_test, predictions). Please use accuracy_score(X_test, predictions) with those parameters instead, or fix the syntax.
from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
data = data.dropna() #<--- SECOND ISSUE
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
score
I have a problem with my logistic regression function, I'm using Pycharm IDE and sklearn.linear_model package LogisticRegression.
My debugger shows AttributeError 'tuple' object has no attribute 'fit' and 'predict'.
Codebelow:
def logistic_regression(df, y):
x_train, x_test, y_train, y_test = train_test_split(
df, y, test_size=0.25, random_state=0)
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
clf = LogisticRegression(random_state=0, solver='sag',
penalty='l2', max_iter=1000, multi_class='multinomial'),
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
return classification_metrics.print_metrics(y_test, y_pred, 'Logistic regression')
Can anyone help spotting the mistake here? Because for other functions I tried fit and predict seem fine.
There is small mistake in the code as I mentioned in the comment.
please remove the comma in the Logistic Regression model object creation.
Also there is no such function called classification_metrics.print_metrics
so I have used the metrics.classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
def logistic_regression(df, y):
x_train, x_test, y_train, y_test = train_test_split(df, y, test_size=0.25, random_state=0)
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
clf = LogisticRegression(random_state=0, solver='sag', penalty='l2', max_iter=1000, multi_class='multinomial')
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
return metrics.classification_report(y_test, y_pred)
function call
logistic_regression(df, y)
I try logistic regression classification using "k-fold cross validation" in python.
my code:
`import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import confusion_matrix,roc_auc_score
data = pd.read_csv('xxx.csv')
X = data[["a","b","c",...]]
y = data["Class"]
def get_predictions(clf, X_train, y_train, X_test):
clf = clf
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
y_pred_prob = clf.predict_proba(X_test)
train_pred = clf.predict(X_train)
print('train-set confusion matrix:\n', confusion_matrix(y_train,train_pred))
return y_pred, y_pred_prob
skf = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 0)
pred_test_full=0
cv_score=[]
i=1
for train_index, test_index in skf.split(X, y):
X_train, y_train = X.loc[train_index], y.loc[train_index]
X_test, y_test = X.loc[test_index], y.loc[test_index]
log_cfl = LogisticRegression(C=2);
log_cfl.fit(X_train, y_train)
y_pred, y_pred_prob = get_predictions(LogisticRegression(C=2), X_train, y_train, X_test)
score=roc_auc_score(y_test,log_cfl.predict(X_test))
print('ROC AUC score: ',score)
cv_score.append(score)
pred_test_full = pred_test_full + y_pred_prob
i+=1`
I get error at this line of code:
`pred_test_full = pred_test_full + y_pred_prob`
For loop runs 2 times. Then in third, I get the error.
'operands could not be broadcast together with shapes <56962,2> <5696..' error.
I couldn't understand what is wrong, could you help to figure out?