Here is my code, and it always returns 100% accuracy, regardless of how big the test size is. I used the train_test_split method, so I do not believe there should be any duplicates of data. Could someone inspect my code?
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
prices.shape
(20640,)
features.shape
(20640, 8)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_train.shape
(16512,)
X_train.shape
(16512, 8)
predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score
EDIT: I have reworked my answer since I found multiple issues. Please copy-paste the below code to ensure no bugs are left.
Issues -
You are using DecisionTreeClassifier instead of DecisionTreeRegressor for a regression problem.
You are removing nans after doing the test train split which will mess up the count of samples. Do the data.dropna() before the split.
You are using the model.score(X_test, y_test) incorrectly by passing it (X_test, predictions). Please use accuracy_score(X_test, predictions) with those parameters instead, or fix the syntax.
from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
data = data.dropna() #<--- SECOND ISSUE
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
score
Related
The performance of the model does not increase during training epoch(s) where values are sorted by a specific row key. Dataset is balance and have 40,000 records with binary classification(0,1).
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
Linear_SVC_classifier = SVC(kernel='linear', random_state=1)#supervised learning
Linear_SVC_classifier.fit(x_train, y_train)
SVC_Accuracy = accuracy_score(y_test, SVC_Prediction)
print("\n\n\nLinear SVM Accuracy: ", SVC_Accuracy)
Add a count vectorizer to your train data and use logistic regression model
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
cv = CountVectorizer()
ctmTr = cv.fit_transform(X_train)
X_test_dtm = cv.transform(X_test)
model = LogisticRegression()
model.fit(ctmTr, y_train)
y_pred_class = model.predict(X_test_dtm)
SVC_Accuracy = accuracy_score(y_test)
print("\n\n\nLinear SVM Accuracy: ", SVC_Accuracy)
the above model definition is something 'equivalent' to this statement
Linear_SVC_classifier = SVC(kernel='linear', random_state=1)
Linear_SVC_classifier.fit(ctmTr, y_train)
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
y_pred = none
accuracy = accuracy_score(y_test, y_pred)
print (accuracy)
What should I put in the y_pred = none area? Is there anything wrong with my code?
Normally you would split your data in train and test, below an example using iris:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
We can fit the model like you did:
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
Get the prediction on test set and score with actual value:
y_pred = logistic_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
I built a linear model with the sklearn based on the Cement and Concrete Composites dataset.
Initially, i used the train_test_split(X, Y, test_size=0.3, Shuffle=False) and i found the train and test error.
Now i try to run the same model 10 times with Shuffle=True and compute the mean and sd of the errors. The new results should be compared to the first ones.
How could i loop the same model n times and save the errors in a list?
Try something like this:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
errors = []
for i in range(10):
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, Shuffle=True)
model = LinearRegression() # the model you want to use here
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
error = accuracy_score(y_test, y_pred) # the error metric you want to use here
errors.append(error)
What you need is cross-validation: repeated evaluation of the model on different splits of the same data. train_test_split in this case is a wrapper around ShuffleSplit cross-validation.
In your case it might look like this:
from sklearn.model_selection import ShuffleSplit, cross_val_score
import numpy as np
from sklearn.linear_model import LinearRegression
X, y = ... # read dataset
model = LinearRegression()
# n_splits=10 is for 10 random shuffled train-test splits
cv = ShuffleSplit(n_splits=10, test_size=.3, random_state=0)
scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')
np.mean(scores), np.std(scores)
If you want to compute the error on your own or do anything else with models/results, you could do it like this:
for train_ids, test_ids in cv.split(X):
model.fit(X[train_ids], y[train_ids])
model.score(X[test_ids], y[test_ids])
...
More about this:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html
I am running the [code] of multi-label classification1.how to fix the NameError that the "X_train" is not defined.the python code is given below.
import scipy
from scipy.io import arff
data, meta = scipy.io.arff.loadarff('./yeast/yeast-train.arff')
from sklearn.datasets import make_multilabel_classification
# this will generate a random multi-label dataset
X, y = make_multilabel_classification(sparse = True, n_labels = 20,
return_indicator = 'sparse', allow_unlabeled = False)
# using binary relevance
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB
# initialize binary relevance multi-label classifier
# with a gaussian naive bayes base classifier
classifier = BinaryRelevance(GaussianNB())
# train
classifier.fit(X_train, y_train)
# predict
predictions = classifier.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test,predictions)
You forgot to split the dataset into train and test sets.
Import the library
from sklearn.model_selection import train_test_split
Add this line before classifier.fit()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train does not exist, you have to split between train and test :
from sklearn.preprocessing import StandardScaler
s =StandardScaler()
X_train = s.fit_transform(X_train)
X_test = s.fit_transform(X_test)
Though I have used StandardScaler or MinMaxScaler to preprocess my data while solving problem with MLPRegressor in sklearn, the predicted values have lot of negative numbers though the training set has all real positive values. Data is here:
https://drive.google.com/open?id=1JF_EpyiMF5WzKZOt6d0iA2174eAheaTW.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPRegressor
x_train, x_test, y_train, y_test = train_test_split(x,y)
min_max_scaler = preprocessing.MinMaxScaler()
x_train = min_max_scaler.fit_transform(x_train)
x_test = min_max_scaler.transform(x_test)
mlp = MLPRegressor(activation='logistic' , solver='sgd' ,verbose=10, hidden_layer_sizes=(10,10), max_iter=1000)
mlp.fit(x_train, y_train)
print("Training set score :%f" % mlp.score(x_train, y_train))
print("Test score :%f" % mlp.score(x_test, y_test))
predictions = mlp.predict(x_test)
Any suggestion where is the problem?