I was following a course on machine learning where the instructor passes a float argument in predict function for polynomial linear regression and it works for him. However, when I pass the code it throws an error stating
"Expected 2D array, got scalar array instead".
I have tried to use the scalar into an array but it does not seem to work.
# Polynomial Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
"""from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)"""
# Feature Scaling
"""from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)"""
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_poly = poly_reg.fit_transform(X)
poly_reg.fit(X_poly, y)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)
# Predicting a new result with Linear Regression
lin_reg.predict(6.5)
The code seems to run smoothly for the instructor. However, I am getting the following error:
ValueError: Expected 2D array, got scalar array instead:
array=6.5.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
This is the error that I am getting.
Actually the predict function accepts 2D array as an input, so u can put 6.5 inside big brackets like this [[6.5]]
lin_reg.predict([[6.5]])
This will work.
Welcome to stackoverflow! You're more likely to get your question answered with a minimal reproducible example, and show at least a portion of any required external files. In this case, I think I've boiled it down to the essentials:
import pandas as pd
# Importing the dataset
salaries = [('Junior', 1, 50000),
('Associate', 2, 60000),
('Senior', 3, 70000),
('Manager', 4, 80000)]
df = pd.DataFrame(salaries)
X = df.iloc[:, 1:2].values
y = df.iloc[:, 2].values
# Fitting Linear Regression to the dataset
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)
# Predicting a new result with Linear Regression
print(lin_reg.predict(6.5))
Although I can't be sure exactly what is in the Position_Salaries.csv, I assume based on other arguments that it looks something like what I've shown. Running that example returns the expected result of 76100 in python 3.6 with sklearn 0.19. If you still get an error, try updating sklearn
pip update sklearn
If you're still getting an error after that, not sure where the difference is, but you can spoof a 2d array by passing the argument like this: lin_reg.predict([[6.5]])
Related
#splitting the dataset into dependent(y) and independent variable(x)
x = training_data.iloc[:,[0,2,3,4,5,6,7]].values
y = training_data.iloc[:,1].values
from sklearn.model_selection import train_test_split
x_train,y_train,x_test,y_test = train_test_split(x,y,test_size = 0.3,random_state = 0)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(x_train,y_train)
i am trying to use logistic regression to train independent(x_train) and dependent variable(y_train) but everytime i run the code i see error
ValueError: y should be a 1d array, got an array of shape (295, 7) instead.
i don't know what to do
You have an error when making the train_test_split.
Be aware of output variables order, the correct output is like below:
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size = 0.3,random_state=0)
Just changing this line, your problem should disappear.
I am new to this, anything will be helpful. The data size is large...
I am not sure where the error could be coming from. I dont even know if this is a good idea hahah, I am using longitude and latitude for my x and y.
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import numpy as np
df = pd.read_csv('aug.csv')
X = df.Lon
y = df.Lat
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)```
ValueError: Expected 2D array, got 1D array instead:
array=[-73.9713 -74.0635 -73.9881 ... -74.1777 -73.9923 -73.9661].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Your X variable for the inputs needs to be an array of features. You have a single column in that csv so it interprets that as a 1D array. The error message you are getting is correct, so change that line "X = df.Lon" to be:
"X = df.Lon.reshape(-1, 1)"
One thing to note: what you're doing doesn't make a ton of sense. What this code is trying to do is predict the Y (lat) given the X (lon). These really should be independent variables, so predicting one from the other will probably not yield any meaningful results.
So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html
I am trying to learn Python and Data Science out of scratch using on line material.
I have just tried to create a simple linear regression model to get some hands on practice after reading a lot of material. However, I get the following error while trying to do it.
Can you kindly help to understand this error and see what I have done wrong.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import randn
np.random.seed(101)
df3=pd.DataFrame(randn(5,2),index ='0 1 2 3 4'.split(), columns='Test Price'.split())
y= df3['Price']
x= df3['Test']
import sklearn.model_selection as model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(x, y, test_size=0.2, random_state=101)
from sklearn.linear_model import LinearRegression
lm2= LinearRegression()
lm2.fit(X_train,y_train)
Error
ValueError: Expected 2D array, got 1D array instead:
array=[-2.01816824 0.65111795 0.90796945 -0.84807698].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
check the doc Link
Parameters
X: {array-like, sparse matrix} of shape (n_samples, n_features) Training
data
So you will have to reshape your X to (n_samples, 1) in your case.
Use
lm2.fit(X_train.values.reshape(-1,1),y_train)
I'm teaching myself some more tricks with python and scikit, and I'm trying to plot a linear regression model. My code can be seen below. But my program and console give the following error: x and y must be the same size. Additionally, my program makes it to the end of my code, but nothing gets plotted.
To fix the size error, the first thing that came to mind was testing the length of x and y with something like len(x) == len(y). But as far as I can tell, my data seems to be the same length. Maybe the error is referring to something other than length (if so, I'm not sure what). Would really appreciate any help.
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Create linear regression object
regr = linear_model.LinearRegression()
#load csv file with pandas
df = pd.read_csv("pokemon.csv")
#remove all string columns
df = df.drop(['Name','Type_1','Type_2','isLegendary','Color','Pr_Male','hasGender','Egg_Group_1','Egg_Group_2','hasMegaEvolution','Body_Style'], axis=1)
y= df.Catch_Rate
x_train, x_test, y_train, y_test = cross_validation.train_test_split(df, y, test_size=0.25, random_state=0)
# Train the model using the training sets
regr.fit(x_train, y_train)
# Make predictions using the testing set
pokemon_y_pred = regr.predict(x_test)
print (pokemon_y_pred)
# Plot outputs
plt.title("Linear Regression Model of Catch Rate")
plt.scatter(x_test, y_test, color='black')
plt.plot(x_test, pokemon_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()
This is referring to the fact that your x-variable has more than one dimension; plot and scatter only work for 2D plots, and it seems that your x_test has multiple features while y_test and pokemon_y_pred are one-dimensional.
This error generates only when you have more different values of x for one y actually there are comparatively more columns in x_test than y_test.Thats why there is a size problem.
There should not be different x for one y:-basic mathematics fundamental.