Dimensions do not match in linear regression - python

I am trying a simple linear regression model but don't understand why an error like this appears:
Here is my code:
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X, Y)
which produces following error:
ValueError: Found input variables with inconsistent numbers of samples: [1518, 15]
The shapes of X and Y are:
X.shape, Y.shape
((1518, 1), (15, 1))
I am trying to predict these Y out of X but my dimensions are not the same; how can I overcome this problem?

It looks like you split your features and explanatory variables wrong way.
Given on what you have written, you have N=1518 samples and 15 features, one of which is the outcome variable.
If this is the case you input vector for Y and matrix for X should take the shapes:
X.shape = (1518,14)
Y.shape = (1518,1)
Assume you are given a pd.dataframe, with features names F1...F15 and your dependent variable Y is F3, then you can split your variables as follows:
Y = df['F3']
X = df.drop('F3', axis=1)
Note: if you are currently using a numpy array, you an easily wrap this in a dataframe using:
import pandas as pd
df = pd.DataFrame(np_array)

Related

Trying to train simple linear regression algorithm. Keep getting error [duplicate]

While practicing Simple Linear Regression Model I got this error,
I think there is something wrong with my data set.
Here is my data set:
Here is independent variable X:
Here is dependent variable Y:
Here is X_train
Here Is Y_train
This is error body:
ValueError: Expected 2D array, got 1D array instead:
array=[ 7. 8.4 10.1 6.5 6.9 7.9 5.8 7.4 9.3 10.3 7.3 8.1].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
And this is My code:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
Thank you
You need to give both the fit and predict methods 2D arrays. Your x_train and x_test are currently only 1 dimensional. What is suggested by the console should work:
x_train= x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
This uses numpy's reshape to transform your array. For example, x = [1, 2, 3] wopuld be transformed to a matrix x' = [[1], [2], [3]] (-1 gives the x dimension of the matrix, inferred from the length of the array and remaining dimensions, 1 is the y dimension - giving us a n x 1 matrix where n is the input length).
Questions about reshape have been answered in the past, this for example should answer what reshape(-1,1) fully means: What does -1 mean in numpy reshape? (also some of the other below answers explain this very well too)
A lot of times when doing linear regression problems, people like to envision this graph
On the input, we have an X of X = [1,2,3,4,5]
However, many regression problems have multidimensional inputs. Consider the prediction of housing prices. It's not one attribute that determines housing prices. It's multiple features (ex: number of rooms, location, etc. )
If you look at the documentation you will see this
It tells us that rows consist of the samples while the columns consist of the features.
However, consider what happens when he have one feature as our input. Then we need an n x 1 dimensional input where n is the number of samples and the 1 column represents our only feature.
Why does the array.reshape(-1, 1) suggestion work? -1 means choose a number of rows that works based on the number of columns provided. See the image for how it changes in the input.
If you look at documentation of LinearRegression of scikit-learn.
fit(X, y, sample_weight=None)
X : numpy array or sparse matrix of shape [n_samples,n_features]
predict(X)
X : {array-like, sparse matrix}, shape = (n_samples, n_features)
As you can see X has 2 dimensions, where as, your x_train and x_test clearly have one.
As suggested, add:
x_train = x_train.reshape(-1, 1)
x_test = x_test.reshape(-1, 1)
Before fitting and predicting the model.
Use
y_pred = regressor.predict([[x_test]])
I would suggest to reshape X at the beginning before you do the split into train and test dataset:
import pandas as pd
import matplotlib as pt
#import data set
dataset = pd.read_csv('Sample-data-sets-for-linear-regression1.csv')
x = dataset.iloc[:, 1].values
y = dataset.iloc[:, 2].values
# Here is the trick
x = x.reshape(-1,1)
#Spliting the dataset into Training set and Test Set
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=0)
#linnear Regression
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
This is what I use
X_train = X_train.values.reshape(-1, 1)
y_train = y_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
y_test = y_test.values.reshape(-1, 1)
This is the solution
regressor.predict([[x_test]])
And for polynomial regression:
regressor_2.predict(poly_reg.fit_transform([[x_test]]))
Modify
regressor.fit(x_train,y_train)
y_pred = regressor.predict(x_test)
to
regressor.fit(x_train.values.reshape(-1,1),y_train)
y_pred = regressor.predict(x_test.values.reshape(-1,1))

Predict a single value with scikit-learn leads to ValueError

I try to do some basic sklearn stuff, with a single X Variable and a single Y Variable. Single I predict with a single column, I have to transform X into a 2D Array. Now I want to predict a single value, but my model only allows me to predict an array of length of length 32.
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import numpy as np
df = pd.read_csv("https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv")
df
X = df["mpg"].values.reshape(1, -1)
y = df["cyl"].values.reshape(1, -1)
y
clf = RandomForestClassifier(random_state=0)
clf.fit(X, y)
clf.predict([[35]])
ValueError: Number of features of the model must match the input.
Model n_features is 32 and input n_features is 1
Can anyone help me to solve this problem?
You fitted the model wrongly with data of the wrong shape, if you do:
X = df["mpg"].values.reshape(1, -1)
y = df["cyl"].values.reshape(1, -1)
X.shape
(1, 32)
This means X is 1 observation and 32 predictors.. whereas what you have is 1 predictor and 32 observations.
So it should be:
X = df[["mpg"]]
y = df["cyl"]
clf = RandomForestClassifier(random_state=0)
clf.fit(X, y)
Then predict using:
clf.predict(np.array(35).reshape(-1,1))
array([4])

Expected 2D Array,got 1D array instead.Where's the mistake?

I am beginning to learn SVM and PCA.I tried to apply SVM on the Sci-Kit Learn 'load_digits' dataset.
When i apply the .fit method to SVC,i get an error:
"Expected 2D array, got 1D array instead:
array=[ 1.9142151 0.58897807 1.30203491 ... 1.02259477 1.07605691
-1.25769703].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature
or array.reshape(1, -1) if it contains a single sample."
Here is the code i wrote:**
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
X_digits, y_digits = load_digits(return_X_y=True)
data = scale(X_digits)
pca=PCA(n_components=10).fit_transform(data)
reduced_data = PCA(n_components=2).fit_transform(data)
from sklearn.svm import SVC
clf = SVC(kernel='rbf', C=1E6)
X=[reduced_data[:,0]
y=reduced_data[:,1]
clf.fit(X, y)
Can someone help me out?Thank you in advance.
Your error results from the fact that clf.fit() requires the array X to be of dimension 2 (currently it is 1 dimensional), and by using X.reshape(-1, 1), X becomes a (N,1) (2D - as we would like) array, as opposed to (N,) (1D), where N is the number of samples in the dataset. However, I also believe that your interpretation of reduced_data may be incorrect (from my limited experience of sklearn):
The reduced_data array that you have contains two principle components (the two most important features in the dataset, n_components=2), which you should be using as the new "data" (X).
Instead, you have taken the first column of reduced_data to be the samples X, and the second column to be the target values y. It is to my understanding that a better approach would be to make X = reduced_data since the sample data should consist of both PCA features, and make y = y_digits, since the labels (targets) are unchanged by PCA.
(I also noticed you defined pca = PCA(n_components=10).fit_transform(data), but did not go on to use it, so I have removed it from the code in my answer).
As a result, you would have something like this:
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.svm import SVC
X_digits, y_digits = load_digits(return_X_y=True)
data = scale(X_digits)
# pca=PCA(n_components=10).fit_transform(data)
reduced_data = PCA(n_components=2).fit_transform(data)
clf = SVC(kernel='rbf', C=1e6)
clf.fit(reduced_data, y_digits)
I hope this has helped!

how to pass float argument in predict function in Python?

I was following a course on machine learning where the instructor passes a float argument in predict function for polynomial linear regression and it works for him. However, when I pass the code it throws an error stating
"Expected 2D array, got scalar array instead".
I have tried to use the scalar into an array but it does not seem to work.
# Polynomial Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
"""from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)"""
# Feature Scaling
"""from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)"""
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly, y)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
# Predicting a new result with Linear Regression
lin_reg.predict(6.5)
The code seems to run smoothly for the instructor. However, I am getting the following error:
ValueError: Expected 2D array, got scalar array instead:
array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
This is the error that I am getting.
Actually the predict function accepts 2D array as an input, so u can put 6.5 inside big brackets like this [[6.5]]
lin_reg.predict([[6.5]])
This will work.
Welcome to stackoverflow! You're more likely to get your question answered with a minimal reproducible example, and show at least a portion of any required external files. In this case, I think I've boiled it down to the essentials:
import pandas as pd
# Importing the dataset
salaries = [('Junior', 1, 50000),
('Associate', 2, 60000),
('Senior', 3, 70000),
('Manager', 4, 80000)]
df = pd.DataFrame(salaries)
X = df.iloc[:, 1:2].values
y = df.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Predicting a new result with Linear Regression
print(lin_reg.predict(6.5))
Although I can't be sure exactly what is in the Position_Salaries.csv, I assume based on other arguments that it looks something like what I've shown. Running that example returns the expected result of 76100 in python 3.6 with sklearn 0.19. If you still get an error, try updating sklearn
pip update sklearn
If you're still getting an error after that, not sure where the difference is, but you can spoof a 2d array by passing the argument like this: lin_reg.predict([[6.5]])

sklearn KNeighborsClassifier "ValueError: Found array with dim 4. Estimator expected <= 2."

I am trying to train a simple model with sklearn kneighborsclassifier on wine quality data. This is my code:
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
import numpy as np
dataframe = pd.read_csv("winequality-white.csv")
dataframe = dataframe.drop(["fixed acidity", "pH", "sulphates"], axis=1)
test = dataframe[110:128]
train = dataframe[15:40]
Y = train["quality"]
X = train.drop(["quality"], axis=1)
#print(X)
#print(Y)
knn = KNeighborsClassifier()
knn.fit(X, Y)
testvals = np.array(test.loc[110, :])
testvals = testvals.reshape(1, -1)
print(knn.predict([[testvals]]))
I get the error "ValueError: Found array with dim 4. Estimator expected <= 2."
I'm fairly certain it has something to do with the shape of my array and I have tried to reshape it, but had no luck. What should I do?
Consider the following (reproducible) example setup:
>>> import pandas as pd
>>> import numpy as np
>>> test = pd.DataFrame.from_records(data=np.random.rand(120, 4))
>>> testvals = np.array(test.loc[110, :])
The way you're reshaping your vector when you pass it to the predict function is creating an array with more than the expected 2 dims (i.e., a multidimensional array). Here's the output of your reshape that you're passing into the predict function:
>>> [[testvals.reshape((-1, 1))]]
[[array([[ 0.25174728],
[ 0.24603664],
[ 0.01781963],
[ 0.49317648]])]]
We can show this produces a 4-d array:
>>> np.asarray([[testvals.reshape((-1, 1))]]).ndim
4
Sklearn expects a 2d array. Here's how you can fix it... If you want to predict the entire matrix, just run:
knn.predict(test)
If you just want to predict for one sample, you could do:
knn.predict([test.loc[110].tolist()])
By the way, it's worth mentioning you have still not popped the target off of test, so the number of features won't match until you do:
y_test = test.pop('quality')
See also this question

Categories