sklearn error - ValueError: bad input shape (330, 5) - python

Trying to fit a logistic regression model but receiving the below error:
ValueError: bad input shape (330, 5)
from sklearn.model_selection import train_test_split
X = ad_data[['Daily Time Spent on Site','Age','Area Income','Daily Internet Usage','Male']]
y= ad_data['Clicked on Ad']
X_train, y_train, X_test, y_test = train_test_split(X,y,test_size=0.33,random_state=42)
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

The error is not very verbose, but I think you should assign train_test_split it this way:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.33,random_state=42)
refer to: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Related

Splitting .npy data for a learning process using fast_ml.model_development

I'm trying to split my data into training, validation, and test sets using Fast_ml for a machine learning purpose. Both my input and output data are read from .npy files through np.load. The input "P" is an array with the shape of (100000, 4, 4, 6, 1) and the target "Q" is a vector of shape (100000,). I use the code below:
from fast_ml.model_development import train_valid_test_split
X_train, y_train, X_valid, y_valid, X_test, y_test = train_valid_test_split(P, Q,
train_size=0.8,
valid_size=0.1,
test_size=0.1)
However, I receive this error:
AttributeError: 'numpy.ndarray' object has no attribute 'drop'
This solved my problem:
from sklearn.model_selection import train_test_split
X_train, X_rem, y_train, y_rem = train_test_split(P,Q, train_size=0.8)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5)

Error using svc.fit(): ValueError: bad input shape

I have data about Parkinson patients stored in the dataframe X and whether a patient has Parkinson indicated by y (0 or 1). This is retrieved by:
X=pd.read_csv('parkinsons.data',index_col=0)
y=X['status']
X=X.drop(['status'],axis=1)
Then, I create training and test samples:
X_train, y_train, X_test, y_test = train_test_split(X,y,test_size=0.3,random_state=7)
I want to use SVC on this training data:
svc=SVC()
svc.fit(X_train,y_train)
Then, I get the error:
ValueError: bad input shape (59, 22).
What did I do wrong and how can I get rid of this error?
You have problems with the definition of train_test_split Careful! train_test_split outputs the X part first followed by the Y part. You are actually naming y_train what is X_test. Change this and it should work:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=7)
Either use this
X_train, y_train, X_test, y_test =train_test_split(X,y,test_size=0.3,random_state=7)
svc=SVC()
svc.fit(X_train,X_test)
Or this
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=7)
svc=SVC()
svc.fit(X_train,y_train)
I prefer using the second one

I am stuck with train_test_split of sklearn model_selection

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(image_data, labels, test_size = 0.2, random_state = 101)
showing the error:
ValueError: With n_samples=0, test_size=0.2 and train_size=None, the
resulting train set will be empty. Adjust any of the aforementioned
parameters.
n_samples=0 means that your dataset is empty. Check the image_data variable

sklearn pipeline raises ValueError after imputer, but running each component does not raise error

I was trying to setup a pipeline with the following code. temp_start, and temp are both pandas DataFrames.
imp = SimpleImputer(missing_values=np.NaN,strategy='mean')
linreg = LinearRegression()
steps = [('imputation', imp),('linear_regression', linreg)]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(temp_start, temp,test_size=0.3, random_state=42)
pipeline.fit(X_train, y_train)
Which raises ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
However, when I run each component of this pipeline, it runs perfectly. I used the code below.
X_train, X_test, y_train, y_test = train_test_split(temp_start, temp, test_size=0.3,
random_state=42)
X_train = imp.fit_transform(X_train)
y_train = imp.fit_transform(y_train)
X_test = imp.fit_transform(X_test)
y_test = imp.fit_transform(y_test)
model = LinearRegression().fit(X_train,y_train)
model.score(X_test, y_test)
I can't figure out why one runs and the other does not.

How to fix "ValueError: Found input variables with inconsistent numbers of samples: [10000, 60000]"?

I have problem with training my code using Stochastic Gradient Descent and MNIST database.
from sklearn.datasets import fetch_mldata
from sklearn.linear_model import SGDClassifier
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist["target"]
some_digit = X[36000]
some_digit_image = some_digit.reshape(28, 28)
X_train, X_train, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
Error at the end of process (in my opinion the last verse of code is bad):
ValueError: Found input variables with inconsistent numbers of samples: [10000, 60000]
It's a typo on your side, you are assigning to X_train twice:
X_train, X_train, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
Correct answer would be:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
BTW. fetch_mldata will be deprecated soon, it would be a better idea to use:
from sklearn.datasets import fetch_openml
X, y = fetch_openml("mnist_784", version=1, return_X_y=True)
I would suggest using a stratified splitting between train and test dataset because some classes might skewed representation in the training.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Categories