from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(image_data, labels, test_size = 0.2, random_state = 101)
showing the error:
ValueError: With n_samples=0, test_size=0.2 and train_size=None, the
resulting train set will be empty. Adjust any of the aforementioned
parameters.
n_samples=0 means that your dataset is empty. Check the image_data variable
Related
Here is my code, and it always returns 100% accuracy, regardless of how big the test size is. I used the train_test_split method, so I do not believe there should be any duplicates of data. Could someone inspect my code?
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
prices.shape
(20640,)
features.shape
(20640, 8)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_train.shape
(16512,)
X_train.shape
(16512, 8)
predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score
EDIT: I have reworked my answer since I found multiple issues. Please copy-paste the below code to ensure no bugs are left.
Issues -
You are using DecisionTreeClassifier instead of DecisionTreeRegressor for a regression problem.
You are removing nans after doing the test train split which will mess up the count of samples. Do the data.dropna() before the split.
You are using the model.score(X_test, y_test) incorrectly by passing it (X_test, predictions). Please use accuracy_score(X_test, predictions) with those parameters instead, or fix the syntax.
from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
data = data.dropna() #<--- SECOND ISSUE
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
score
Trying to fit a logistic regression model but receiving the below error:
ValueError: bad input shape (330, 5)
from sklearn.model_selection import train_test_split
X = ad_data[['Daily Time Spent on Site','Age','Area Income','Daily Internet Usage','Male']]
y= ad_data['Clicked on Ad']
X_train, y_train, X_test, y_test = train_test_split(X,y,test_size=0.33,random_state=42)
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
The error is not very verbose, but I think you should assign train_test_split it this way:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=42)
refer to: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Using KNN and I wanted to experiment with different normalizers (Normalizer(), MinMaxScaler(), StandardScaler() etc).
I have loaded the data into a variable called X:
X = pd.read_csv('C:/Users/rmahesh/documents/parkinson.csv')
After doing some data wrangling, I try and run this code:
from sklearn import preprocessing
from sklearn.decomposition import PCA
T = preprocessing.Normalizer().fit(X)
from sklearn.cross_validation import train_test_split
T_train, T_test, y_train, y_test = train_test_split(T, y, test_size = 0.3, random_state = 7)
from sklearn.svm import SVC
model = SVC()
model = model.fit(T_train, y_train)
score = model.score(T_test, y_test)
print(score)
The specific error code I am getting is this:
TypeError: Singleton array array(Normalizer(copy=True, norm='l2'), dtype=object) cannot be considered a valid collection.
The code in which the error is appearing is this line:
T_train, T_test, y_train, y_test = train_test_split(T, y,
test_size = 0.3, random_state = 7)
Any help would be greatly appreciated!
You're fitting your normalizer and then treating it as an array directly. Replace
T = preprocessing.Normalizer().fit(X)
With
T = preprocessing.Normalizer().fit_transform(X)
So that the actual output of the normalization is used instead. .fit() returns the Normalizer object itself.
Though I have used StandardScaler or MinMaxScaler to preprocess my data while solving problem with MLPRegressor in sklearn, the predicted values have lot of negative numbers though the training set has all real positive values. Data is here:
https://drive.google.com/open?id=1JF_EpyiMF5WzKZOt6d0iA2174eAheaTW.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPRegressor
x_train, x_test, y_train, y_test = train_test_split(x,y)
min_max_scaler = preprocessing.MinMaxScaler()
x_train = min_max_scaler.fit_transform(x_train)
x_test = min_max_scaler.transform(x_test)
mlp = MLPRegressor(activation='logistic' , solver='sgd' ,verbose=10, hidden_layer_sizes=(10,10), max_iter=1000)
mlp.fit(x_train, y_train)
print("Training set score :%f" % mlp.score(x_train, y_train))
print("Test score :%f" % mlp.score(x_test, y_test))
predictions = mlp.predict(x_test)
Any suggestion where is the problem?
I am using the following code to check SGDClassifier
import numpy as np
from sklearn.datasets import load_boston
from sklearn.linear_model import SGDClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
data = load_boston()
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target)
x_scalar = StandardScaler()
y_scalar = StandardScaler()
x_train = x_scalar.fit_transform(x_train)
y_train = y_scalar.fit_transform(y_train)
x_test = x_scalar.transform(x_test)
y_test = y_scalar.transform(y_test)
regressor = SGDClassifier(loss='squared_loss')
scores = cross_val_score(regressor, x_train, y_train, cv=5)
print 'cross validation r scores ', scores
print 'average score ', np.mean(scores)
regressor.fit_transform(x_train, y_train)
print 'test set r score ', regressor.score(x_test,y_test)
However when I run it I get deprecation warnings to reshape and
the following value error
ValueError Traceback (most recent call last)
<ipython-input-55-4d64d112f5db> in <module>()
18
19 regressor = SGDClassifier(loss='squared_loss')
---> 20 scores = cross_val_score(regressor, x_train, y_train, cv=5)
ValueError: Unknown label type: (array([ -1.89568750e+00, -1.75715217e+00, -1.68255622e+00,
-1.66124309e+00, -1.62927339e+00, -1.54402088e+00,
-1.49073806e+00, -1.41614211e+00, -1.40548554e+00,
-1.34154616e+00, -1.32023303e+00, -1.30957647e+00,
-1.27760677e+00, -1.26695021e+00, -1.25629365e+00,
-1.20301082e+00, -1.17104113e+00, -1.16038457e+00,....]),)
What could be the probable error in the code ?
In classification tasks, the dependent variable (or the target) is categorical. We try to predict if a claim is fraudulent or not, for example. In regression, on the other hand, the dependent variable is numerical. It can be measured.
In the Boston Housing dataset, the dependent variable is "Median value of owner-occupied homes in $1000's" (You can see the description by executing print(data.DESCR)). It is a continuous variable and cannot be predicted with a classifier.
If you want to test the classifier, you can use another dataset. For example, change load_boston() to load_iris(). Note that you also need to remove the transformation for the target variable - it is for numerical variables. With these modifications, it should work correctly.
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import SGDClassifier
from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
data = load_iris()
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target)
x_scalar = StandardScaler()
x_train = x_scalar.fit_transform(x_train)
x_test = x_scalar.transform(x_test)
classifier = SGDClassifier(loss='squared_loss')
scores = cross_val_score(classifier, x_train, y_train, cv=5)
scores
Out: array([ 0.33333333, 0.2173913 , 0.31818182, 0. , 0.19047619])