Regression with Python (nympy/pandas) [duplicate]

Regression with Python (nympy/pandas) [duplicate] - python

I have the following variables:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def part1_scatter():
%matplotlib notebook
plt.figure()
plt.scatter(X_train, y_train, label='training data')
plt.scatter(X_test, y_test, label='test data')
plt.legend(loc=4);
And the following question:
Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. np.linspace(0,10,100)) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.
This is my code, but it don't work out:
def answer_one():
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
results = []
pred_data = np.linspace(0,10,100)
degree = [1,3,6,9]
y_train1 = y_train.reshape(-1,1)
for i in degree:
poly = PolynomialFeatures(degree=i)
pred_poly1 = poly.fit_transform(pred_data[:,np.newaxis])
X_F1_poly = poly.fit_transform(X_train[:,np.newaxis])
linreg = LinearRegression().fit(X_F1_poly, y_train1)
pred = linreg.predict(pred_poly1)
results.append(pred)
dataArray = np.array(results).reshape(4, 100)
return dataArray
I receive this error:
line 58 for i
in degree: ^ IndentationError: unexpected
indent
Could you tell me where the problem is?

The return statement should be performed after the for is done, so it should be indented under the for, not further in.

At the start of your line
n = 15
You stopped with identing. So that part isn't recognized as the function. This can be solved by putting 4 spaces on all lines from n = 15 onwards.

Related

Why am I getting negative SCORE even if i am using scoring = 'neg_mean_squared_error'?

Why am I getting negative SCORE even if i am using scoring = 'neg_mean_squared_error'?
I tried the following code from apparently the source code:
neg_mean_squared_error_scorer = make_scorer(mean_squared_error, greater_is_better=False)
Source Code
However it doesn't work. And I don't see the point of using it if we are supposed to use scoring = 'neg_mean_squared_error'.
Here is the code I used:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from math import sqrt
from sklearn.metrics import \
r2_score, get_scorer, make_scorer, mean_squared_error
from sklearn.linear_model import \
Lasso, Ridge, LassoCV,LinearRegression
from sklearn.preprocessing import \
StandardScaler, PolynomialFeatures
from sklearn.model_selection import \
KFold, RepeatedKFold, GridSearchCV, \
cross_validate, train_test_split
# Features
x1 = np.linspace(-20,20,100)
x1 = np.array(x1).reshape(-1,1)
x2 = pow(x1,2)
x3 = pow(x1,3)
x4 = pow(x1,4)
x5 = pow(x1,5)
# Parameters
beta_0 = 1.75
beta_1 = 5
beta_3 = 0.05
beta_5 = -10.3
eps_mu = 0 # epsilon mean
eps_sigma = sqrt(4) # epsilon standard deviation
eps_size = 100 # epsilon size
np.random.seed(1) # Fixing a seed
eps = np.random.normal(eps_mu, eps_sigma, eps_size)
eps = np.array(eps).reshape(-1,1)
y = beta_0 + beta_1*x1 + beta_3*x3 + beta_5*x5 + eps
data = np.concatenate((y,x1,x2,x3,x4,x5), axis = 1)
X = data[:,1:6]
y = data[:,0]
alphas_to_try = np.linspace(0.00000000000000000000000001,0.002,10) ######## To modify #######
scoring = 'neg_mean_squared_error'
#scoring = (mean_squared_error, greater_is_better=False)
scorer = get_scorer(scoring)
k = 5
cv = KFold(n_splits = k)
for train_index, test_index in cv.split(data):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)
validation_scores = []
train_scores = []
results_list = []
test_scores = []
for curr_alpha in alphas_to_try:
regmodel = Lasso(alpha = curr_alpha)
results = cross_validate(
regmodel, X, y, scoring=scoring, cv=cv,
return_train_score = True)
validation_scores.append(np.mean(results['test_score']))
train_scores.append(np.mean(results['train_score']))
results_list.append(results)
regmodel.fit(X,y)
y_pred = regmodel.predict(X_test)
test_scores.append(scorer(regmodel, X_test, y_test))
chosen_alpha_id = np.argmax(validation_scores)
chosen_alpha = alphas_to_try[chosen_alpha_id]
max_validation_score = np.max(validation_scores)
test_score_at_chosen_alpha = test_scores[chosen_alpha_id]
print('chosen_alpha:', chosen_alpha)
print('max_validation_score:', max_validation_score)
print('test_score_at_chosen_alpha:', test_score_at_chosen_alpha)
plt.figure(figsize = (8,8))
sns.lineplot(y = validation_scores, x = alphas_to_try, label = 'validation_data')
sns.lineplot(y = train_scores, x = alphas_to_try, label = 'training_data')
plt.axvline(x=chosen_alpha, linestyle='--')
sns.lineplot(y = test_scores, x = alphas_to_try, label = 'test_data')
plt.xlabel('alpha_parameter')
plt.ylabel(scoring)
plt.title('LASSO Regularisation')
plt.legend()
plt.show()
Why the code is not working? Why am I getting negative scores?
Output:
What I am supposed to get:
I am supposed to get something like the screenshot above, but MSE instead of r2 on the y axis.

As the name suggests, neg_mean_squared_error is the negative of the mean-squared-error, so negative scores is expected (in fact, it is positive scores that are impossible).
As to the plots, there's a bigger problem. Your train and validation scores are obtained using cross_validate, and are fine. But your test scores are obtained by fitting the regressor to the entire X, y and then scoring that on X_test, y_test, a subset of the training set! So those scores are quite optimistically biased.
A quick check on the scale of the errors: you have a degree-5 polynomial with the original feature taking values between -20 and 20. So the target takes values on the order of 10^6, and so squared errors may be expected on the order of 10^12.

Shapes not aligned when fitting polynomial regression

Long time listener, first time caller...
I know a similar question has been answered in the past (see here for other thread I have referenced), but I am still having difficulties. How can I get my regression to fit? My code is below:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
#data
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
#regression fitting
X_predict_input = np.linspace(0,10,100).reshape(-1,1)
y_train = y_train.reshape((-1,1))
X_train = X_train.reshape((-1,1))
#looping through different degree values
for i, degree in enumerate([1,3,6,9]):
poly = PolynomialFeatures(degree=degree)
X_train_poly = poly.fit_transform(X_train)
linreg = LinearRegression().fit(X_train_poly, y_train)
result[i,:] = linreg.predict(X_predict_input)
I tried to fix the shaping issues with X_train and y_train, but after looking into each shape, I am thinking that the X_train_poly is what is driving this error...
X_train shape: (11, 1)
y_train shape: (11, 1)
X_train_poly shape: (11, 10)
Respective error message:
ValueError: shapes (100,1) and (2,1) not aligned: 1 (dim 1) != 2 (dim 0)
When I try to address the shape inconsistencies in X_train_poly by the following...
X_train_poly = poly.fit_transform(X_train).reshape((-1,1))
...I receive this error:
ValueError: Found input variables with inconsistent numbers of samples: [22, 11]
I have spent an embarrassing amount of time on this, so any insight at all would be greatly appreciated!
Thank you in advance :)

I think the problem is quite simple. You're using the PolynomialFeatures transform to generate features for the training data but when it comes to prediction, you're not applying the same transform to the input data.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# data
np.random.seed(0)
n = 15
x = np.linspace(0, 10, n) + np.random.randn(n)/5
y = np.sin(x) + x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x.reshape((-1, 1)),
y.reshape((-1, 1)),
random_state=0)
# Check data matrices are in columns
assert(X_train.shape == (11, 1))
assert(y_train.shape == (11, 1))
# Build library of polynomial features
degree = 3
poly = PolynomialFeatures(degree)
X_train_poly = poly.fit_transform(X_train)
assert(X_train_poly.shape == (11, 4))
# Fit model
linreg = LinearRegression().fit(X_train_poly, y_train)
# Make prediction
X_predict = np.linspace(0, 10, 100).reshape(-1, 1)
X_predict_poly = poly.fit_transform(X_predict)
y_predict = linreg.predict(X_predict_poly)
assert(y_predict.shape == X_predict.shape)
Update:
To avoid the inconvenience of having to apply the transform every time you make a prediction, you might want to check out sklearn.Pipeline:
# Using a pipeline to automate the input transformation
from sklearn.pipeline import Pipeline
poly = PolynomialFeatures(degree)
model = LinearRegression()
pipeline = Pipeline(steps=[('t', poly), ('m', model)])
linreg = pipeline.fit(X_train, y_train)
y_predict2 = linreg.predict(X_predict)
assert(np.array_equal(y_predict, y_predict2))

How do fix ValueError: x and y must be the same size?

import numpy as np
import pandas as pd
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle
import matplotlib.pyplot as pyplot
import pickle
from matplotlib import style
data = pd.read_csv("student-mat.csv", sep=";")
data = data[["G1", "G3", "G3", "studytime", "failures", "absences", "freetime"]]
predict = "G3"
X = np.array(data.drop([predict], 1))
Y = np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, Y, test_size = 0.1)
best = 0
for _ in range(3000):
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, Y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print(acc)
if acc > best:
best = acc
with open("studentmodel.pickle", "wb") as f:
pickle.dump(linear, f)
pickle_in = open("studentmodel.pickle", "rb")
linear = pickle.load(pickle_in)
print('Co: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
predictions = linear.predict(x_test)
for x in range(len(predictions)):
print(predictions[x], x_test[x], y_test[x])
p = 'G1'
style.use("ggplot")
pyplot.scatter(data[p],data["G3"])
pyplot.xlabel(p)
pyplot.ylabel("Final Grade")
pyplot.show()
Error: raise ValueError ("X and y must be the same size")
Can anyone please explain to me what I have done wrong? Because well I am new to programing and was following a tutorial and everything up to the last 5 lines was working fine but when I try to make a graph it gives me this error "raise ValueError ("X and y must be the same size")" it only allows me to make a graph if I write the code like this
style.use("ggplot")
pyplot.scatter(data["G3"],data["G3"])
pyplot.xlabel(p)
pyplot.ylabel("Final Grade")
pyplot.show()
Which only gives me a straight line on a graph
Thank you for any help!

I have run following code using this data.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt
import pickle
from matplotlib import style
data = pd.read_csv("student-mat.csv")
# Here, I have changed columns because "G2" was occurring twice.
data = data[["G1", "G2", "G3", "studytime", "failures", "absences", "freetime"]]
predict = "G3"
print(data.head())
X = np.array(data.drop([predict], 1))
print(X)
y = np.array(data[predict])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
best = 0
for _ in range(3000):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
linear = LinearRegression()
linear.fit(X_train, y_train)
acc = linear.score(X_test, y_test)
print(acc)
if acc > best:
best = acc
with open("studentmodel.pickle", "wb") as f:
pickle.dump(linear, f)
pickle_in = open("studentmodel.pickle", "rb")
linear = pickle.load(pickle_in)
print('Co: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
predictions = linear.predict(X_test)
for x in range(len(predictions)):
print(predictions[x], X_test[x], y_test[x])
p = 'G1'
style.use("ggplot")
plt.scatter(data[p], data["G3"])
plt.xlabel(p)
plt.ylabel("Final Grade")
plt.show()
This will produce the following image.

How to increase the model accuracy of multiple linear regression

This is the custom code
#Custom model for multiple linear regression
import numpy as np
import pandas as pd
dataset = pd.read_csv("50s.csv")
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,4:5].values
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
x[:,3] = lb.fit_transform(x[:,3])
from sklearn.preprocessing import OneHotEncoder
on = OneHotEncoder(categorical_features=[3])
x = on.fit_transform(x).toarray()
x = x[:,1:]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=1/5, random_state=0)
con = np.matrix(X_train)
z = np.matrix(y_train)
#training model
result1 = con.transpose()*con
result1 = np.linalg.inv(result1)
p = con.transpose()*z
f = result1*p
l = []
for i in range(len(X_test)):
temp = f[0]*X_test[i][0] + f[1]*X_test[i][1] +f[2]*X_test[i][2]+f[3]*X_test[i][3]+f[4]*X_test[i][4]
l.append(temp)
import matplotlib.pyplot as plt
plt.scatter(y_test,l)
plt.show()
Then I created created a model with scikit learn
and compared the results with y_test and l(predicted values of above code)
comparisons are as follows
for i in range(len(prediction)):
print(y_test[i],prediction[i],l[i],sep=' ')
103282.38 103015.20159795816 [[116862.44205399]]
144259.4 132582.27760816005 [[118661.40080974]]
146121.95 132447.73845175043 [[124952.97891882]]
77798.83 71976.09851258533 [[60680.01036438]]
This were the comparison between y_test,scikit-learn model predictions and custom code predictions
please help with the accuracy of model.
blue :Custom model predictions
yellow : scikit-learn model predictions

Expected 2D array error not getting resolved

i am trying to use my machine learning model on dataset where i have only two columns while standard scaling them,i got the error expected 2D array but got 1 .
Below is the code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
"""from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)"""
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
# Fitting SVR to the dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
# Predicting a new result
y_pred = regressor.predict(6.5)
y_pred = sc_y.inverse_transform(y_pred)
# Visualising the SVR results
plt.scatter(X, y, color = 'red')
plt.plot(X, regressor.predict(X), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
when i try to put
y = sc_y.fit_transform([y])
like this i received no error but when i execute next 3 lines i receive another error.
which is bad input shape (1, 10)
can anyone help me on this?

The StandardScaler() function in sklearn expects the input(X) to be in the following format:
X : numpy array of shape [n_samples, n_features]
So, reshape X to (-1,1) if you have only one feature column.
sc_X.fit_transform(X.reshape[-1,1])
This should work!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regression with Python (nympy/pandas) [duplicate] - python

The return statement should be performed after the for is done, so it should be indented under the for, not further in.

At the start of your line n = 15 You stopped with identing. So that part isn't recognized as the function. This can be solved by putting 4 spaces on all lines from n = 15 onwards.

Related

Why am I getting negative SCORE even if i am using scoring = 'neg_mean_squared_error'?

Shapes not aligned when fitting polynomial regression

How do fix ValueError: x and y must be the same size?

How to increase the model accuracy of multiple linear regression

Expected 2D array error not getting resolved

Categories

Resources