Shapes not aligned when fitting polynomial regression - python

Long time listener, first time caller...
I know a similar question has been answered in the past (see here for other thread I have referenced), but I am still having difficulties. How can I get my regression to fit? My code is below:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
#data
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
#regression fitting
X_predict_input = np.linspace(0,10,100).reshape(-1,1)
y_train = y_train.reshape((-1,1))
X_train = X_train.reshape((-1,1))
#looping through different degree values
for i, degree in enumerate([1,3,6,9]):
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
linreg = LinearRegression().fit(X_train_poly, y_train)
result[i,:] = linreg.predict(X_predict_input)
I tried to fix the shaping issues with X_train and y_train, but after looking into each shape, I am thinking that the X_train_poly is what is driving this error...
X_train shape: (11, 1)
y_train shape: (11, 1)
X_train_poly shape: (11, 10)
Respective error message:
ValueError: shapes (100,1) and (2,1) not aligned: 1 (dim 1) != 2 (dim 0)
When I try to address the shape inconsistencies in X_train_poly by the following...
X_train_poly = poly.fit_transform(X_train).reshape((-1,1))
...I receive this error:
ValueError: Found input variables with inconsistent numbers of samples: [22, 11]
I have spent an embarrassing amount of time on this, so any insight at all would be greatly appreciated!
Thank you in advance :)

I think the problem is quite simple. You're using the PolynomialFeatures transform to generate features for the training data but when it comes to prediction, you're not applying the same transform to the input data.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# data
np.random.seed(0)
n = 15
x = np.linspace(0, 10, n) + np.random.randn(n)/5
y = np.sin(x) + x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x.reshape((-1, 1)),
y.reshape((-1, 1)),
random_state=0)
# Check data matrices are in columns
assert(X_train.shape == (11, 1))
assert(y_train.shape == (11, 1))
# Build library of polynomial features
degree = 3
poly = PolynomialFeatures(degree)
X_train_poly = poly.fit_transform(X_train)
assert(X_train_poly.shape == (11, 4))
# Fit model
linreg = LinearRegression().fit(X_train_poly, y_train)
# Make prediction
X_predict = np.linspace(0, 10, 100).reshape(-1, 1)
X_predict_poly = poly.fit_transform(X_predict)
y_predict = linreg.predict(X_predict_poly)
assert(y_predict.shape == X_predict.shape)
Update:
To avoid the inconvenience of having to apply the transform every time you make a prediction, you might want to check out sklearn.Pipeline:
# Using a pipeline to automate the input transformation
from sklearn.pipeline import Pipeline
poly = PolynomialFeatures(degree)
model = LinearRegression()
pipeline = Pipeline(steps=[('t', poly), ('m', model)])
linreg = pipeline.fit(X_train, y_train)
y_predict2 = linreg.predict(X_predict)
assert(np.array_equal(y_predict, y_predict2))

Related

Can anyone help resolve my polynomial regression model with feature scaling and transformations?

The error I got is:
ValueError: shapes (1,2) and (15,) not aligned: 2 (dim 1) != 15 (dim 0)
Code:
import numpy as np
import pandas as pd
dataset = pd.read_csv('music.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X_train)
regressor = LinearRegression()
regressor.fit(X_poly, y_train)
print(y_train.inverse_transform((regressor.predict([[33,0]]))))
This is the full error:
print(y_test.inverse_transform((regressor.predict([[33,0]]))))
AttributeError: 'numpy.ndarray' object has no attribute 'inverse_transform'
Process finished with exit code 1
y_train is a numpy array, but you are treating it as a fitted encoder.
y_train.inverse_transform((regressor.predict([[33,0]]))) is your problem. It should be le.inverse_transform(regressor.predict([[33,0]])).

ValueError: Shape of passed values is (39, 1), indices imply (39, 7)

Heyy I am trying to do a simple logistic regression on my data which is returns (y) versus market indices (x).
import numpy as np
from sklearn import metrics
data = pd.read_excel ('Datafile.xlsx', index_col=0)
#split dataset into features and target variable
col_features = ['Market Beta','Value','Size','High-Yield Spread','Term Spread','Momentum','Bid-Ask Spread']
target=['Return']
x = data[col_features] #features
y = data[target] #target
#split x and y into training and testing datasets
from sklearn.model_selection import train_test_split
x_train, y_train, x_test, y_test = train_test_split (x, y, test_size = 0.25, random_state = 0)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
y_train = np.argmax(y_train, axis=1)
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
The error I get is
ValueError: Shape of passed values is (39, 1), indices imply (39, 7)
Thank you.
you just confused the order of train_test_split results so x_test and y_train became switched. Proper order should be this:
x_train, x_test, y_train, y_test = train_test_split(x, y, ...

Expected 2D array error not getting resolved

i am trying to use my machine learning model on dataset where i have only two columns while standard scaling them,i got the error expected 2D array but got 1 .
Below is the code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
"""from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)"""
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
# Fitting SVR to the dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
# Predicting a new result
y_pred = regressor.predict(6.5)
y_pred = sc_y.inverse_transform(y_pred)
# Visualising the SVR results
plt.scatter(X, y, color = 'red')
plt.plot(X, regressor.predict(X), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
when i try to put
y = sc_y.fit_transform([y])
like this i received no error but when i execute next 3 lines i receive another error.
which is bad input shape (1, 10)
can anyone help me on this?
The StandardScaler() function in sklearn expects the input(X) to be in the following format:
X : numpy array of shape [n_samples, n_features]
So, reshape X to (-1,1) if you have only one feature column.
sc_X.fit_transform(X.reshape[-1,1])
This should work!

Error: Shapes (1,4) and (14,14) not aligned

So I'm a newbie to Machine Learning and a little baffled by this error:
Shapes (1,4) and (14,14) not aligned: 4 (dim 1) != 14 (dim 0)
Here is the full error:
File "/Users/jim/anaconda3/lib/python3.6/site-packages/sklearn/utils/extmath.py", line 140, in safe_sparse_dot
return np.dot(a, b)
ValueError: shapes (1,4) and (14,14) not aligned: 4 (dim 1) != 14 (dim 0)
My test set has 4 rows of data and training set 14 rows of data, as indicated by (1,4) and (14,14). At least I think that's what that means.
I'm trying to fit a simple linear regression to a Training set as indicated by my code below:
# Fit Simple Linear Regression to Training Set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
X_train = X_train.reshape(1,-1)
y_train = y_train.reshape(1,-1)
regressor.fit(X_train, y_train)
Then predict the Test Set Results:
# Predicting the Test Set Results
X_test = X_test.reshape(1,-1)
y_pred = regressor.predict(X_test)
My code is failing on the last line with the above error:
y_pred = regressor.predict(X_test)
Any hints in the right direction would be great.
Here is my whole code sample:
# Simple Linear Regression
# Import Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import dataset
dataset = pd.read_csv('NBA.csv')
X = dataset.iloc[:, 1].values
y = dataset.iloc[:, :-1].values
# Splitting the dataset into Train and Test
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
# None
# Fit Simple Linear Regression to Training Set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
X_train = X_train.reshape(1,-1)
y_train = y_train.reshape(1,-1)
regressor.fit(X_train, y_train)
# Predicting the Test Set Results
X_test = X_test.reshape(1,-1)
y_pred = regressor.predict(X_test)
** EDIT **
I checked the shape of X and y. Here is my output below:
dataset = pd.read_csv('NBA.csv')
X = dataset.iloc[:, 1].values
y = dataset.iloc[:, :-1].values
print(X.shape)
print(y.shape)
-->(18,)
-->(18, 1)
Please replace reshape(1,-1) to reshape(-1, 1) for all usages. The former transforms an array into (1 person x n features) and the latter does (n persons x 1 feature). feature is hight, in this case.
If you modified import section as below, there is no need to reshape the array since their shapes are already satisfy the form of (n persons x 1 feature).
# Import dataset
dataset = pd.read_csv('NBA.csv')
X = dataset.iloc[:, 1].values
y = dataset.iloc[:, 0].values
X = X.reshape(-1, 1)
y = y.reshape(-1, 1)
In an early age of the sklearn, you can feed vector as inputs. But recently it has changed and now you need to explicitly indicate whether the vector is (1 sample x n features) or (n samples x 1 feature) by using reshape or some other methods.

Regression with Python (nympy/pandas) [duplicate]

I have the following variables:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def part1_scatter():
%matplotlib notebook
plt.figure()
plt.scatter(X_train, y_train, label='training data')
plt.scatter(X_test, y_test, label='test data')
plt.legend(loc=4);
And the following question:
Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. np.linspace(0,10,100)) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.
This is my code, but it don't work out:
def answer_one():
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
results = []
pred_data = np.linspace(0,10,100)
degree = [1,3,6,9]
y_train1 = y_train.reshape(-1,1)
for i in degree:
poly = PolynomialFeatures(degree=i)
pred_poly1 = poly.fit_transform(pred_data[:,np.newaxis])
X_F1_poly = poly.fit_transform(X_train[:,np.newaxis])
linreg = LinearRegression().fit(X_F1_poly, y_train1)
pred = linreg.predict(pred_poly1)
results.append(pred)
dataArray = np.array(results).reshape(4, 100)
return dataArray
I receive this error:
line 58 for i
in degree: ^ IndentationError: unexpected
indent
Could you tell me where the problem is?
The return statement should be performed after the for is done, so it should be indented under the for, not further in.
At the start of your line
n = 15
You stopped with identing. So that part isn't recognized as the function. This can be solved by putting 4 spaces on all lines from n = 15 onwards.

Categories