Predict a single value with scikit-learn leads to ValueError - python

I try to do some basic sklearn stuff, with a single X Variable and a single Y Variable. Single I predict with a single column, I have to transform X into a 2D Array. Now I want to predict a single value, but my model only allows me to predict an array of length of length 32.
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import numpy as np
df = pd.read_csv("https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv")
df
X = df["mpg"].values.reshape(1, -1)
y = df["cyl"].values.reshape(1, -1)
y
clf = RandomForestClassifier(random_state=0)
clf.fit(X, y)
clf.predict([[35]])
ValueError: Number of features of the model must match the input.
Model n_features is 32 and input n_features is 1
Can anyone help me to solve this problem?

You fitted the model wrongly with data of the wrong shape, if you do:
X = df["mpg"].values.reshape(1, -1)
y = df["cyl"].values.reshape(1, -1)
X.shape
(1, 32)
This means X is 1 observation and 32 predictors.. whereas what you have is 1 predictor and 32 observations.
So it should be:
X = df[["mpg"]]
y = df["cyl"]
clf = RandomForestClassifier(random_state=0)
clf.fit(X, y)
Then predict using:
clf.predict(np.array(35).reshape(-1,1))
array([4])

Related

Trying to train simple linear regression algorithm. Keep getting error [duplicate]

While practicing Simple Linear Regression Model I got this error,
I think there is something wrong with my data set.
Here is my data set:
Here is independent variable X:
Here is dependent variable Y:
Here is X_train
Here Is Y_train
This is error body:
ValueError: Expected 2D array, got 1D array instead:
array=[ 7. 8.4 10.1 6.5 6.9 7.9 5.8 7.4 9.3 10.3 7.3 8.1].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
And this is My code:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
Thank you
You need to give both the fit and predict methods 2D arrays. Your x_train and x_test are currently only 1 dimensional. What is suggested by the console should work:
x_train= x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
This uses numpy's reshape to transform your array. For example, x = [1, 2, 3] wopuld be transformed to a matrix x' = [[1], [2], [3]] (-1 gives the x dimension of the matrix, inferred from the length of the array and remaining dimensions, 1 is the y dimension - giving us a n x 1 matrix where n is the input length).
Questions about reshape have been answered in the past, this for example should answer what reshape(-1,1) fully means: What does -1 mean in numpy reshape? (also some of the other below answers explain this very well too)
A lot of times when doing linear regression problems, people like to envision this graph
On the input, we have an X of X = [1,2,3,4,5]
However, many regression problems have multidimensional inputs. Consider the prediction of housing prices. It's not one attribute that determines housing prices. It's multiple features (ex: number of rooms, location, etc. )
If you look at the documentation you will see this
It tells us that rows consist of the samples while the columns consist of the features.
However, consider what happens when he have one feature as our input. Then we need an n x 1 dimensional input where n is the number of samples and the 1 column represents our only feature.
Why does the array.reshape(-1, 1) suggestion work? -1 means choose a number of rows that works based on the number of columns provided. See the image for how it changes in the input.
If you look at documentation of LinearRegression of scikit-learn.
fit(X, y, sample_weight=None)
X : numpy array or sparse matrix of shape [n_samples,n_features]
predict(X)
X : {array-like, sparse matrix}, shape = (n_samples, n_features)
As you can see X has 2 dimensions, where as, your x_train and x_test clearly have one.
As suggested, add:
x_train = x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
Before fitting and predicting the model.
Use
y_pred = regressor.predict([[x_test]])
I would suggest to reshape X at the beginning before you do the split into train and test dataset:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
# Here is the trick
x = x.reshape(-1,1)
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
This is what I use
X_train = X_train.values.reshape(-1, 1)
y_train = y_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
y_test = y_test.values.reshape(-1, 1)
This is the solution
regressor.predict([[x_test]])
And for polynomial regression:
regressor_2.predict(poly_reg.fit_transform([[x_test]]))
Modify
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
to
regressor.fit(x_train.values.reshape(-1,1),y_train)
y_pred = regressor.predict(x_test.values.reshape(-1,1))

Dimensions do not match in linear regression

I am trying a simple linear regression model but don't understand why an error like this appears:
Here is my code:
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X, Y)
which produces following error:
ValueError: Found input variables with inconsistent numbers of samples: [1518, 15]
The shapes of X and Y are:
X.shape, Y.shape
((1518, 1), (15, 1))
I am trying to predict these Y out of X but my dimensions are not the same; how can I overcome this problem?
It looks like you split your features and explanatory variables wrong way.
Given on what you have written, you have N=1518 samples and 15 features, one of which is the outcome variable.
If this is the case you input vector for Y and matrix for X should take the shapes:
X.shape = (1518,14)
Y.shape = (1518,1)
Assume you are given a pd.dataframe, with features names F1...F15 and your dependent variable Y is F3, then you can split your variables as follows:
Y = df['F3']
X = df.drop('F3', axis=1)
Note: if you are currently using a numpy array, you an easily wrap this in a dataframe using:
import pandas as pd
df = pd.DataFrame(np_array)

single data to predict via linear regression model with more than one categorical feature

I built a linear regression model to predict the sales numbers for a product,
In my case I have 5 features, 4 of them are categorical.
MONTH REGION INTERVENANT CONFIG WEIGHT SALES_NB
I used OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0,1,2,3])
X = onehotencoder.fit_transform(X).toarray()
X = X [:, 1:]
(correct me if I am wrong)
I want to know how do I format my data to pass it to predict().
Actually if I pass:
Xnew = np.array([[2,2,14895,614,0.1]])
ynew = regressor.predict(Xnew)
I got this error:
ValueError: shapes (1,4) and (428,) not aligned: 4 (dim 1) != 428 (dim 0)
Try encoding the new sample with onehotencoder before you pass it to the predictor:
Xnew = np.array([[2,2,14895,614,0.1]])
Xnew_encoded = onehotencoder.transform(Xnew)
ynew = regressor.predict(Xnew_encoded)

Error: Shapes (1,4) and (14,14) not aligned

So I'm a newbie to Machine Learning and a little baffled by this error:
Shapes (1,4) and (14,14) not aligned: 4 (dim 1) != 14 (dim 0)
Here is the full error:
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/utils/extmath.py", line 140, in safe_sparse_dot
return np.dot(a, b)
ValueError: shapes (1,4) and (14,14) not aligned: 4 (dim 1) != 14 (dim 0)
My test set has 4 rows of data and training set 14 rows of data, as indicated by (1,4) and (14,14). At least I think that's what that means.
I'm trying to fit a simple linear regression to a Training set as indicated by my code below:
# Fit Simple Linear Regression to Training Set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
X_train = X_train.reshape(1,-1)
y_train = y_train.reshape(1,-1)
regressor.fit(X_train, y_train)
Then predict the Test Set Results:
# Predicting the Test Set Results
X_test = X_test.reshape(1,-1)
y_pred = regressor.predict(X_test)
My code is failing on the last line with the above error:
y_pred = regressor.predict(X_test)
Any hints in the right direction would be great.
Here is my whole code sample:
# Simple Linear Regression
# Import Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import dataset
dataset = pd.read_csv('NBA.csv')
X = dataset.iloc[:, 1].values
y = dataset.iloc[:, :-1].values
# Splitting the dataset into Train and Test
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
# None
# Fit Simple Linear Regression to Training Set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
X_train = X_train.reshape(1,-1)
y_train = y_train.reshape(1,-1)
regressor.fit(X_train, y_train)
# Predicting the Test Set Results
X_test = X_test.reshape(1,-1)
y_pred = regressor.predict(X_test)
** EDIT **
I checked the shape of X and y. Here is my output below:
dataset = pd.read_csv('NBA.csv')
X = dataset.iloc[:, 1].values
y = dataset.iloc[:, :-1].values
print(X.shape)
print(y.shape)
-->(18,)
-->(18, 1)
Please replace reshape(1,-1) to reshape(-1, 1) for all usages. The former transforms an array into (1 person x n features) and the latter does (n persons x 1 feature). feature is hight, in this case.
If you modified import section as below, there is no need to reshape the array since their shapes are already satisfy the form of (n persons x 1 feature).
# Import dataset
dataset = pd.read_csv('NBA.csv')
X = dataset.iloc[:, 1].values
y = dataset.iloc[:, 0].values
X = X.reshape(-1, 1)
y = y.reshape(-1, 1)
In an early age of the sklearn, you can feed vector as inputs. But recently it has changed and now you need to explicitly indicate whether the vector is (1 sample x n features) or (n samples x 1 feature) by using reshape or some other methods.

multi-class logistic regression using sklearn (representing y as multi-class)

I'm working on a what I thought was a fairly simple machine learning problem.
In this problem the y (label) I'm wanting to classify is a multi-class value. In this dataset I have 6 possible choices.
I've been using the preprocessing.LabelBinarizer() function to pivot my y set to an array of ones or zeros in hopes that this would be sufficient (e.g. [0 0 0 0 0 1]).
This code below fails on the model.fit() due to a ValueError: Found arrays with inconsistent numbers of samples: [ 217 1302] || 1302 is 217*6 BTW
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y = pd.DataFrame(api_y)
y = np.ravel(y)
It seems that the binarizer returns results that appear like 6 columns to the algorithm instead of 1 column containing an array of 6 digits.
I've tried to force it into an array model using the code below but then the fit function bails for another reason: ValueError: Unknown label type array([array[0,1,0,0,0]), arrary([0,1,0,0...])
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y_list = []
for x in api_y:
item = {'gear': np.array(x)}
y_list.append(item)
y = pd.DataFrame(y_list)
print("after changing to binary classes array y is "+repr(y.shape))
y = np.ravel(y)
I also tried the sklearn_pandas.DataFrameMapper to no avail as it also created 6 distinct fields vs. an array of 6 values represented as one field.
Any help or suggestions would be appreciated...full version of what I thought was right posted here for clarity:
#!/Library/Frameworks/Python.framework/Versions/3.5/bin/python3
import pandas as pd
import numpy as np
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn import metrics
import sklearn_pandas
#
# load traing data taken from 2 years of strava rides
df = pd.DataFrame.from_csv("gear_train.csv")
#
# Prepare data for logistic regression
#
y, X = dmatrices('gear ~ distance + moving_time + total_elevation_gain + average_speed + max_speed + average_cadence + has_heartrate + device_watts', df, return_type="dataframe")
#
# Fix up y to be a flattened array of 1 column (binary array?)
#
lb = preprocessing.LabelBinarizer()
api_y = lb.fit_transform(df['gear'])
y = pd.DataFrame(api_y)
y = np.ravel(y)
#
# run the logistic regression
#
model = LogisticRegression()
model = model.fit(X, y)
score = model.score(X, y)
#
# evaluate the model by splitting into training and testing data sets
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model2 = LogisticRegression()
model2.fit(X_train, y_train)
predicted = model2.predict(X_test)
print("predicted="+repr(lb2.inverse_transform(predicted)))
print(metrics.classification_report(y_test, predicted))
#
# do a 10-fold CV test to see if this model holds up
#
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print(scores.mean())enter code here
The root cause of my problem was y field contained string values instead of numeric. For example b12345 as a key instead of 12345. Once I changed that to use LabelEncoding and Decoding it worked like a champ.

Categories